Part 3 of 3: LLM Fundamentals Series
Updated February 2026 - Fine-tuning transforms generic models into specialized experts. This guide covers everything from deciding IF you should fine-tune, to HOW to do it effectively, with real-world examples and costs.
Hot take: Most people fine-tune when they shouldn't. But when you SHOULD fine-tune, the results can be game-changing.
Table of Contents
- What is Fine-Tuning?
- When to Fine-Tune (and When NOT To)
- Fine-Tuning Methods
- Which Models Can Be Fine-Tuned?
- Real-World Examples
- Step-by-Step Implementation
- Costs & ROI
What is Fine-Tuning? 🤔
The Simple Explanation
For a 5-year-old: Imagine you have a really smart friend who knows about EVERYTHING - space, dinosaurs, cooking, math. But you need them to become a SUPER EXPERT at just dinosaurs. Fine-tuning is like giving them a special dinosaur course so they become the BEST at dinosaurs!
For developers: Fine-tuning is the process of taking a pre-trained model and continuing its training on a specialized dataset to adapt it for specific tasks or domains.
Why Not Train from Scratch?
Training from scratch:
- Cost: $5M-$100M+
- Time: Months
- Data needed: Billions of tokens
- Expertise: Research-level ML team
- Result: General-purpose model
Fine-tuning existing model:
- Cost: $10-$10,000
- Time: Hours to days
- Data needed: 100s-10,000s examples
- Expertise: ML engineer
- Result: Specialized model
The Chef Analogy
Pre-trained model = Experienced chef
↓
Knows all cooking techniques
Can make any dish reasonably well
Fine-tuning = Teaching your recipes
↓
Same chef, now expert at YOUR restaurant
Knows YOUR style, YOUR ingredients
Consistent with YOUR brand
From scratch = Culinary school from age 5
↓
Start with nothing
Learn everything
Takes years
When to Fine-Tune (and When NOT To) 🎯
✅ You SHOULD Fine-Tune When:
1. Specialized Domain Knowledge
Example: Medical diagnosis
❌ GPT-5: "That rash might be eczema or psoriasis"
✅ Fine-tuned: "Based on morphology and distribution pattern,
differential diagnosis includes:
1. Psoriasis vulgaris (most likely - 85%)
2. Seborrheic dermatitis (12%)
3. Drug eruption (3%)
Recommend: biopsy for confirmation"
Why: Medical terminology, diagnostic reasoning, treatment protocols
2. Consistent Format/Style
Example: Legal document generation
❌ GPT-5: Creates documents with varying structure
✅ Fine-tuned: Always follows exact company template
Uses specific clause language
Maintains consistent formatting
Why: You need EXACT formatting every time
3. Private/Proprietary Knowledge
Example: Company-specific customer support
❌ GPT-5: Doesn't know your products/policies
✅ Fine-tuned: Expert on YOUR product line
Knows YOUR return policies
Uses YOUR brand voice
Why: Information not in public training data
4. Performance on Specific Task
Example: SQL query generation from natural language
❌ GPT-5: 70% correct queries
✅ Fine-tuned: 95% correct queries
Why: Learns your database schema, naming conventions
5. Cost at Scale
Example: 1M requests/month for sentiment analysis
❌ GPT-5 API: $1,250/month
✅ Fine-tuned GPT-3.5: $200/month (6x cheaper!)
Why: Smaller models can match performance on narrow tasks
❌ You Should NOT Fine-Tune When:
1. Prompt Engineering Can Solve It
❌ Don't fine-tune: "Make responses more concise"
✅ Instead: Add to system prompt:
"Keep responses under 3 sentences. Be direct."
Cost: Free vs $500+ for fine-tuning
2. RAG (Retrieval-Augmented Generation) is Better
❌ Don't fine-tune: Adding company knowledge base
✅ Instead: Use RAG
- Embed documents
- Retrieve relevant context
- Pass to model in prompt
Why: More flexible, easier to update, no retraining
3. You Have <100 Quality Examples
❌ Don't fine-tune: 50 examples
✅ Instead: Use few-shot learning in prompt
Show 3-5 examples in each request
Why: Insufficient data = overfitting
4. Task Changes Frequently
❌ Don't fine-tune: News categorization (topics change)
✅ Instead: Use general model + updated prompt
Why: Fine-tuning is expensive to repeat
5. Model Already Excels
❌ Don't fine-tune: GPT-5 already 95% accurate
✅ Instead: Use it as-is
Why: Marginal gains not worth effort
Decision Tree
Start: Do I need custom behavior?
│
No → Use base model
│
Yes → Can prompt engineering solve it?
│
No → Is it knowledge-based?
│
Yes → Use RAG
│
No → Do I have 500+ quality examples?
│
No → More data collection OR few-shot learning
│
Yes → Do I need <100ms latency OR >1M requests/month?
│
No → Maybe stick with API
│
Yes → FINE-TUNE! 🎯
Fine-Tuning Methods 🛠️
1. Full Fine-Tuning (Traditional)
What it is:
Update ALL model parameters during training.
How it works:
# Conceptual example
model = load_pretrained_model("llama-4-70b")
# Update EVERY weight in all layers
for batch in training_data:
loss = model.forward(batch)
loss.backward() # Gradients for ALL 70B parameters
optimizer.step() # Update ALL parameters
Pros:
- ✅ Maximum performance gains
- ✅ Complete adaptation
- ✅ Can dramatically change behavior
Cons:
- ❌ Requires full model in memory (100+ GB GPU RAM)
- ❌ Expensive (1000s of GPU hours)
- ❌ Risk of catastrophic forgetting
- ❌ Slow
When to use:
- You have massive GPU resources
- Need maximum quality
- Have 10,000+ diverse examples
- Can tolerate forgetting general knowledge
Real cost:
LLaMA 4 Maverick (400B):
GPU: 8x H100 (80GB each)
Time: 24-72 hours
Cost: $5,000-$15,000
LLaMA 3.1 70B:
GPU: 4x A100 (80GB each)
Time: 8-24 hours
Cost: $800-$2,400
2. LoRA (Low-Rank Adaptation) 🌟 MOST POPULAR
What it is:
Freeze original model weights, train small "adapter" matrices.
Simple explanation:
Imagine the model is a huge library. Instead of rewriting books, you add sticky notes with updates!
How it works:
# Original model: 70B parameters
base_model = load_model("llama-4-70b") # FROZEN
# Add small trainable matrices
lora_A = nn.Linear(4096, 8) # Very small!
lora_B = nn.Linear(8, 4096) # Very small!
# During inference:
output = base_model(x) + lora_B(lora_A(x))
↑ ↑
frozen trainable (~0.1% of params)
The math (simplified):
Original layer: W (4096 × 4096) = 16M parameters
LoRA decomposition:
W_new = W + B @ A
Where:
W: 4096 × 4096 (frozen)
B: 4096 × 8 (trainable)
A: 8 × 4096 (trainable)
Total trainable: 4096×8 + 8×4096 = 65K parameters
Reduction: 16M → 65K = 99.6% fewer parameters!
Pros:
- ✅ 10-100x less memory needed
- ✅ 10-100x faster training
- ✅ 10-100x cheaper
- ✅ No catastrophic forgetting
- ✅ Can merge adapters OR swap them
- ✅ Share base model, just save adapters (10-100 MB vs 100+ GB)
Cons:
- ❌ Slightly lower performance than full fine-tuning
- ❌ Not suitable for completely new domains
When to use:
- Most practical use cases (this is the default choice!)
- Limited GPU budget
- Want to fine-tune multiple tasks (swap adapters)
- Need to iterate quickly
Real cost:
LLaMA 4 Maverick (400B) with LoRA:
GPU: 1x H100 (80GB)
Time: 2-6 hours
Cost: $50-$200
LLaMA 3.1 70B with LoRA:
GPU: 1x A100 (80GB)
Time: 1-3 hours
Cost: $20-$80
GPT-3.5 via OpenAI API:
Cost: ~$8 per 1M training tokens
3. QLoRA (Quantized LoRA)
What it is:
LoRA + quantization = Fine-tune on consumer hardware!
How it works:
# Load model in 4-bit precision
model = load_in_4bit("llama-3.1-70b")
# Memory: 70B × 4 bits = 35 GB (vs 140 GB for 16-bit!)
# Add LoRA adapters (still 16-bit for training)
lora_adapters = create_lora_adapters(rank=8)
# Train only adapters
for batch in data:
# Forward pass through 4-bit model
loss = model(batch)
# Backprop only through adapters
loss.backward()
Quantization explained:
16-bit (standard):
Number: 3.14159
Precision: Very high
Memory: 2 bytes
4-bit (quantized):
Number: ~3.125 (rounded)
Precision: Lower, but often good enough
Memory: 0.5 bytes (4x reduction!)
Pros:
- ✅ Can fine-tune 70B model on 1x RTX 4090 (24GB)!
- ✅ Cheapest option
- ✅ Accessible to hobbyists
- ✅ Surprisingly good quality
Cons:
- ❌ Slight quality degradation
- ❌ More complex setup
When to use:
- Limited hardware (single consumer GPU)
- Budget is top priority
- Prototyping/research
Real cost:
LLaMA 3.1 70B with QLoRA:
GPU: 1x RTX 4090 (24GB)
Time: 4-12 hours
Cost: $0-40 (if you own GPU)
4. P-Tuning / Prefix Tuning / Prompt Tuning
What it is:
Don't tune model at all - just tune soft prompts!
How it works:
# Instead of modifying weights...
# Add learnable "virtual tokens" to input
learnable_prefix = nn.Parameter(torch.randn(10, 4096))
# 10 tokens × 4096 dimensions = 40K parameters only!
# Prepend to all inputs
input_with_prefix = torch.cat([learnable_prefix, user_input])
output = frozen_model(input_with_prefix)
Simple explanation:
Like having a magic phrase that makes the model behave exactly right, but the phrase is learned numbers instead of words!
Pros:
- ✅ Tiny - only 10-100K parameters
- ✅ Super fast to train
- ✅ Multiple "prompts" for one model
Cons:
- ❌ Limited expressive power
- ❌ Only good for simpler adaptations
When to use:
- Very simple style/format changes
- Extreme resource constraints
- Exploring feasibility
Method Comparison Table
Method | Params | Memory | Time | Cost | Quality | Use When
--------------|--------|--------|------|-------|---------|----------
Full FT | 100% | 100GB+ | 24h | $5K+ | ★★★★★ | Max quality needed
LoRA | 0.1% | 40GB | 3h | $100 | ★★★★☆ | Default choice
QLoRA | 0.1% | 20GB | 8h | $30 | ★★★☆☆ | Limited hardware
Prompt Tuning | 0.001% | 40GB | 1h | $20 | ★★☆☆☆ | Simple changes
Which Models Can Be Fine-Tuned? 🔧
Proprietary Models (API-Based)
OpenAI:
Fine-tunable models (Feb 2026):
- GPT-4o-mini ✅
- GPT-3.5-turbo ✅
- GPT-4o (limited access) ✅
Method: API-based (upload data, they train)
Cost: $8-25 per 1M training tokens
Time: 10 minutes - 2 hours
Data format: JSONL with prompt-completion pairs
Anthropic (via AWS Bedrock):
Fine-tunable models:
- Claude 3 Haiku ✅
- Claude 3.5 Haiku ✅
Method: AWS Bedrock integration
Cost: ~$10-30 per 1M tokens
Time: 1-4 hours
Google:
Fine-tunable models:
- Gemini 1.5 Flash ✅
- Gemini 1.5 Pro (limited) ✅
Method: Vertex AI
Cost: ~$7-20 per 1M tokens
Cohere:
All models fine-tunable ✅
Method: API
Cost: ~$10 per 1M tokens
Open-Source Models (Self-Hosted)
Meta LLaMA Family:
LLaMA 4 Scout (70B): ✅ Full access
LLaMA 4 Maverick (400B): ✅ Full access
LLaMA 3.3 (70B): ✅ Full access
LLaMA 3.1 (8B, 70B, 405B): ✅ Full access
License: Llama 4 Community License
- Free for research
- Free for commercial use
- Can distribute fine-tuned models
Methods supported:
- Full fine-tuning ✅
- LoRA ✅
- QLoRA ✅
- DPO (preference tuning) ✅
DeepSeek:
DeepSeek-V3 (671B): ✅ Full access
DeepSeek-R1 (671B): ✅ Full access
DeepSeek-Coder (1-236B): ✅ Full access
License: DeepSeek License (MIT-like)
- Free for all use
- Can modify and distribute
Best for: Coding, reasoning tasks
Mistral:
Mistral Medium 3.1: ✅
Mixtral 8x7B: ✅
Mixtral 8x22B: ✅
License: Apache 2.0
- Truly open source
- No restrictions
Best for: European users, cost efficiency
Qwen (Alibaba):
Qwen 3 (235B-A22B): ✅
Qwen 2.5 (all sizes): ✅
License: Qwen License
- Free for research and commercial
Best for: Multilingual, especially Chinese
Comparison Matrix:
Model Family | Open? | License | Best For | FT Difficulty
---------------|-------|------------|------------------------|---------------
GPT-3.5/4o | No | Proprietary| Easy API integration | Easy
Claude | No | Proprietary| Safety, helpfulness | Medium
Gemini | No | Proprietary| Multimodal | Medium
LLaMA 4 | Yes | Permissive | General purpose | Easy-Med
DeepSeek | Yes | MIT-like | Code, reasoning | Medium
Mistral | Yes | Apache 2.0 | EU compliance | Easy
Qwen | Yes | Permissive | Multilingual | Easy
Real-World Fine-Tuning Examples 📚
Example 1: Customer Support Chatbot
Company: SaaS startup (project management tool)
Problem:
Generic GPT-4:
User: "How do I share a board?"
GPT-4: "To share a board, typically you'd look for a 'Share'
button in your application..."
❌ Generic, doesn't mention their specific UI
❌ Doesn't use company terminology
Solution: Fine-tuned GPT-3.5
Data collected:
500 support ticket examples:
{
"prompt": "How do I share a board?",
"completion": "Click the '👥 Collaborators' button in the top-right
of any board. Then click 'Invite Members' and enter
their email. They'll receive an invite link immediately!"
}
{
"prompt": "Can I export to Excel?",
"completion": "Yes! Go to Board Settings > Export > 'Download as Excel'.
Premium plans can also schedule automated exports."
}
... 498 more examples
Training:
# Using OpenAI API
openai api fine_tunes.create \
-t support_qa_train.jsonl \
-m gpt-3.5-turbo \
--suffix "support-v1"
Cost: $12 (500 examples × 100 tokens avg × $8/1M)
Time: 18 minutes
Results:
Metric | Before | After | Improvement
------------------------|--------|--------|-------------
Answer accuracy | 65% | 94% | +45%
Uses company terms | 10% | 98% | +880%
Mentions correct UI | 20% | 96% | +380%
Customer satisfaction | 3.2/5 | 4.7/5 | +47%
Cost savings:
- Fewer escalations: -35% → $8K/month saved
- Faster resolution: -40% time → 2 FTE saved
Total ROI: ~$25K/month from $12 fine-tune!
Example 2: Medical Diagnosis Assistant
Organization: Telemedicine platform
Problem:
Generic model hallucinates medical advice
Can't differentiate severity
Doesn't follow medical reasoning protocols
Solution: Fine-tuned LLaMA 4 Scout (70B) with LoRA
Data collected:
10,000 anonymized case notes from dermatologists:
Format:
{
"input": "Patient: 35F, presents with raised, scaly patches on elbows
and knees, silvery appearance, no pain, worsens in winter",
"output": "<reasoning>
Presentation suggests psoriasis vulgaris:
- Symmetrical distribution (elbows/knees)
- Silvery scales (characteristic)
- Koebner phenomenon possible
- Seasonal variation (common in psoriasis)
Differential diagnosis:
1. Psoriasis vulgaris (90% confidence)
2. Eczema (5%)
3. Fungal infection (5%)
</reasoning>
<recommendation>
- Confirm with skin biopsy
- Start topical corticosteroid
- Refer to dermatology if no improvement in 2 weeks
- Avoid triggers (stress, dry air)
</recommendation>"
}
Training setup:
# Using HuggingFace + PEFT (LoRA)
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Scout-70B",
load_in_8bit=True,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train
trainer.train()
Hardware: 2x A100 (80GB)
Time: 12 hours
Cost: ~$300 (cloud GPU rental)
Results:
Metric | Base LLaMA | Fine-tuned | Specialist
----------------------------|------------|------------|-------------
Correct diagnosis (top-1) | 62% | 89% | 92%
Correct diagnosis (top-3) | 78% | 96% | 97%
Follows medical reasoning | 45% | 94% | 98%
Hallucinates treatments | 18% | 2% | 0.5%
Appropriate urgency level | 70% | 95% | 97%
Impact:
- Reduced misdiagnosis: 27% → 4%
- Faster triage: -60% time
- Doctor confidence in AI assist: 85%
Example 3: SQL Query Generation
Company: Business intelligence platform
Problem:
User: "Show me top 10 customers by revenue last quarter"
Generic GPT-4:
SELECT customer_name, revenue
FROM customers
WHERE date >= DATE_SUB(NOW(), INTERVAL 3 MONTH)
ORDER BY revenue DESC
LIMIT 10;
❌ Wrong table names (they use 'clients', not 'customers')
❌ Wrong column names
❌ Doesn't use their data warehouse conventions
Solution: Fine-tuned CodeLlama 34B
Data collected:
2,500 natural language → SQL pairs from their schema:
{
"instruction": "Show me top 10 customers by revenue last quarter",
"input": "Schema: clients (client_id, name, industry),
orders (order_id, client_id, amount, order_date)",
"output": "SELECT
c.name,
c.industry,
SUM(o.amount) as total_revenue
FROM clients c
JOIN orders o ON c.client_id = o.client_id
WHERE o.order_date >= DATE_TRUNC('quarter',
CURRENT_DATE - INTERVAL '1 quarter')
AND o.order_date < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY c.client_id, c.name, c.industry
ORDER BY total_revenue DESC
LIMIT 10;"
}
Training:
# Using Axolotl framework
accelerate launch -m axolotl.cli.train config.yml
Config highlights:
base_model: codellama/CodeLlama-34b-hf
adapter: qlora
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
Hardware: 1x A100 (80GB)
Time: 8 hours
Cost: $200
Results:
Metric | Base Model | Fine-tuned
--------------------------|------------|------------
Syntactically correct SQL | 82% | 98%
Uses correct table names | 45% | 99%
Uses correct columns | 51% | 97%
Follows conventions | 30% | 95%
Executable queries | 68% | 94%
Business impact:
- Queries generated correctly: 68% → 94%
- Time to insights: -70% (analysts)
- Self-service adoption: +300%
Example 4: Legal Contract Analysis
Firm: Corporate law firm
Problem:
Need to extract specific clauses from 1000s of contracts
Generic models miss nuanced legal language
Solution: Fine-tuned GPT-4o-mini
Data collected:
1,000 annotated contracts with clause extractions:
{
"prompt": "Extract termination clauses from this contract:\n\n[contract text]",
"completion": "Termination Clauses Found:\n\n
1. Section 12.3 - Either party may terminate with 90 days
written notice\n
2. Section 12.4 - Immediate termination for material breach\n
3. Section 12.5 - Automatic termination if Company acquired\n\n
Key Terms:\n
- Notice period: 90 days
- Cure period: 30 days (breach)
- Survival clauses: Sections 8, 11, 13"
}
Training:
# OpenAI API
import openai
openai.FineTuningJob.create(
training_file="file-abc123",
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 4,
"batch_size": 1,
"learning_rate_multiplier": 0.3
}
)
Cost: $150 (1000 examples, average 2K tokens each)
Time: 2.5 hours
Results:
Metric | GPT-4o-mini | Fine-tuned | Lawyer
------------------------------|-------------|------------|--------
Finds all relevant clauses | 73% | 96% | 98%
Correctly interprets terms | 68% | 92% | 95%
Misses critical clauses | 15% | 2% | 0.5%
Extraction time (per contract)| 3 min | 1 min | 25 min
ROI:
- Process 50 contracts/day (was 8/day)
- 6.25x productivity increase
- Cost: $150 one-time + $0.15/contract inference
- Savings: $180K/year in paralegal time
Step-by-Step Implementation Guide 🚀
Phase 1: Preparation (Week 1)
Step 1: Define Success Metrics
BEFORE starting:
✅ What does success look like?
✅ How will you measure it?
✅ What's the baseline performance?
Example:
Metric: SQL query correctness
Current: 68% executable queries
Target: >90% executable queries
Measurement: Test set of 200 queries
Step 2: Collect/Create Dataset
Data requirements:
Minimum viable:
- Classification: 100-500 examples
- Generation: 500-2,000 examples
- Complex reasoning: 2,000-10,000 examples
Quality > Quantity:
1 great example > 10 mediocre examples
Data format (most common):
[
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is photosynthesis?"},
{"role": "assistant", "content": "Photosynthesis is..."}
]
},
{
"messages": [...]
}
]
Step 3: Clean and Validate Data
def validate_dataset(data):
"""Check for common issues"""
issues = []
for i, example in enumerate(data):
# Check format
if "messages" not in example:
issues.append(f"Example {i}: Missing 'messages' key")
# Check length
messages = example.get("messages", [])
total_tokens = estimate_tokens(messages)
if total_tokens > 4000:
issues.append(f"Example {i}: Too long ({total_tokens} tokens)")
# Check quality
if len(messages[-1]["content"]) < 10:
issues.append(f"Example {i}: Response too short")
return issues
# Fix common issues
def clean_dataset(data):
cleaned = []
for example in data:
# Remove empty responses
if len(example["messages"][-1]["content"]) < 10:
continue
# Truncate if too long
truncated = truncate_to_length(example, max_tokens=4000)
cleaned.append(truncated)
return cleaned
Step 4: Split Data
Training set: 80% (800 examples)
Validation set: 10% (100 examples)
Test set: 10% (100 examples)
Important: Keep test set COMPLETELY separate
Phase 2: Training (Week 2)
Option A: Using OpenAI API (Easiest)
import openai
from openai import OpenAI
client = OpenAI()
# 1. Upload training file
with open("training_data.jsonl", "rb") as f:
training_file = client.files.create(
file=f,
purpose="fine-tune"
)
# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs": 3,
"batch_size": 1,
"learning_rate_multiplier": 0.5
}
)
# 3. Monitor progress
while True:
job_status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job_status.status}")
if job_status.status == "succeeded":
model_name = job_status.fine_tuned_model
print(f"Model ready: {model_name}")
break
time.sleep(60)
# 4. Use fine-tuned model
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "test query"}]
)
Option B: Using HuggingFace + LoRA (Most Control)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# 1. Load model and tokenizer
model_name = "meta-llama/Llama-3.1-70B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16, # rank - higher = more capacity
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"], # which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 8.4M || all params: 70.6B || trainable%: 0.01%
# 3. Load and preprocess data
dataset = load_dataset("json", data_files="training_data.jsonl")
def preprocess(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048
)
tokenized_dataset = dataset.map(preprocess, batched=True)
# 4. Set training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
)
# 5. Create trainer and train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
trainer.train()
# 6. Save adapter
model.save_pretrained("./my-fine-tuned-model")
Option C: Using Axolotl (Best for QLora)
# config.yml
base_model: NousResearch/Llama-3.1-70B
# LoRA config
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
# Dataset
datasets:
- path: training_data.jsonl
type: alpaca
# Training
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 0.0002
warmup_steps: 100
# Quantization
load_in_4bit: true
# Run training
accelerate launch -m axolotl.cli.train config.yml
# Merge adapter (optional)
python -m axolotl.cli.merge_lora config.yml --lora_model_dir="./output"
Phase 3: Evaluation (Week 3)
Quantitative Evaluation:
def evaluate_model(model, test_set):
results = {
"correct": 0,
"total": 0,
"errors": []
}
for example in test_set:
prediction = model.generate(example["input"])
ground_truth = example["output"]
# Task-specific evaluation
is_correct = evaluate_answer(prediction, ground_truth)
if is_correct:
results["correct"] += 1
else:
results["errors"].append({
"input": example["input"],
"predicted": prediction,
"expected": ground_truth
})
results["total"] += 1
results["accuracy"] = results["correct"] / results["total"]
return results
# Run evaluation
eval_results = evaluate_model(fine_tuned_model, test_data)
print(f"Accuracy: {eval_results['accuracy']:.2%}")
# Analyze errors
for error in eval_results["errors"][:5]: # First 5 errors
print(f"\nInput: {error['input']}")
print(f"Predicted: {error['predicted']}")
print(f"Expected: {error['expected']}")
Qualitative Evaluation:
# Compare base vs fine-tuned side-by-side
test_prompts = [
"How do I reset my password?",
"What's included in Premium plan?",
"Can I export data to CSV?"
]
for prompt in test_prompts:
base_response = base_model.generate(prompt)
ft_response = finetuned_model.generate(prompt)
print(f"\n{'='*50}")
print(f"Prompt: {prompt}")
print(f"\nBase model: {base_response}")
print(f"\nFine-tuned: {ft_response}")
print(f"{'='*50}")
A/B Testing (Production):
# Route 50% to each model
import random
def get_response(user_query):
if random.random() < 0.5:
model = "base"
response = base_model.generate(user_query)
else:
model = "fine-tuned"
response = finetuned_model.generate(user_query)
# Log for analysis
log_interaction(user_query, response, model)
return response
# After 1 week, analyze
analyze_ab_test_results()
Phase 4: Deployment (Week 4)
Deployment options:
1. API endpoint:
from fastapi import FastAPI
from peft import PeftModel
app = FastAPI()
# Load model once at startup
base_model = AutoModelForCausalLM.from_pretrained("llama-3.1-70B")
model = PeftModel.from_pretrained(base_model, "./my-adapter")
@app.post("/generate")
async def generate(prompt: str):
response = model.generate(prompt)
return {"response": response}
2. Replace OpenAI calls:
# Before
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
# After
response = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[{"role": "user", "content": prompt}]
)
3. Gradual rollout:
def get_model_for_user(user_id):
# Gradual rollout: 0% → 10% → 50% → 100%
rollout_percentage = get_rollout_percentage()
if hash(user_id) % 100 < rollout_percentage:
return fine_tuned_model
else:
return base_model
Costs & ROI Analysis 💰
Training Costs
OpenAI API:
GPT-3.5-turbo:
Cost: $8 per 1M training tokens
Example: 1,000 examples × 500 tokens = 500K tokens
Total: $4
GPT-4o-mini:
Cost: $25 per 1M training tokens
Same example: $12.50
Hosting: $0 (served by OpenAI)
Self-hosted (LoRA):
LLaMA 3.1 70B:
GPU: 1x A100 80GB
Time: 3 hours
Cost: $10/hour × 3 = $30
Storage: $0.10/GB/month
Model: 140GB × $0.10 = $14/month
Inference: $0.50-2/hour GPU time
Self-hosted (QLoRA):
LLaMA 3.1 70B:
GPU: 1x RTX 4090 (your own)
Time: 8 hours
Cost: $0 (electricity ~$2)
Or cloud RTX 4090:
Cost: $0.50/hour × 8 = $4
Inference Costs
Cost per 1M tokens:
Model | Input | Output | Total (50/50)
-------------------------|--------|--------|---------------
GPT-4o | $2.50 | $10.00 | $6.25
GPT-3.5-turbo | $0.50 | $1.50 | $1.00
Fine-tuned GPT-3.5 | $0.50 | $1.50 | $1.00
Claude Sonnet 3.5 | $3.00 | $15.00 | $9.00
Self-hosted LLaMA 3.1 70B| $0.80 | $0.80 | $0.80
ROI Examples
Example 1: Customer Support
Scenario: 100K queries/month
Before (human agents):
Cost: 10 agents × $4K/month = $40K/month
After (fine-tuned GPT-3.5):
API cost: 100K queries × 1K tokens × $1/M = $100/month
1 agent for escalations: $4K/month
Total: $4,100/month
Savings: $35,900/month
ROI: (35,900 - 100) / 100 = 359x in month 1
Fine-tuning investment: $50
Break-even: Instant
Example 2: Legal Contract Review
Scenario: 500 contracts/month
Before (paralegals):
Time: 500 contracts × 2 hours = 1,000 hours
Cost: 1,000 hours × $50/hour = $50K/month
After (fine-tuned GPT-4o-mini):
API cost: 500 × 10K tokens × $1.50/M = $7.50
Review time: 500 × 0.25 hours = 125 hours
Cost: 125 × $50 = $6,250
Total: $6,257.50/month
Savings: $43,742.50/month
ROI: (43,742 - 257) / 257 = 170x
Fine-tuning investment: $150
Break-even: Day 1
Example 3: Code Generation
Scenario: Internal tool, 10 developers
Before (manual coding):
Time saved: 5 hours/week/dev = 50 hours/week
Value: 50 hours × $100/hour × 4 weeks = $20K/month
After (self-hosted LLaMA-Coder):
Training cost (one-time): $200
Hosting: $500/month (GPU server)
Net savings: $20K - $500 = $19,500/month
ROI: 39x
Break-even: 11 days
Common Pitfalls & Solutions ⚠️
Pitfall 1: Overfitting
Symptom:
Training accuracy: 99%
Test accuracy: 65%
Causes:
- Too few examples
- Too many epochs
- Memorizing instead of generalizing
Solutions:
# 1. Reduce epochs
num_epochs = 1 # Start with 1, increase if needed
# 2. Increase dropout
lora_dropout = 0.1 # Regularization
# 3. Early stopping
early_stopping_patience = 3
# 4. Data augmentation
def augment_data(example):
# Paraphrase questions
# Add variations
return variations
Pitfall 2: Catastrophic Forgetting
Symptom:
Model becomes expert at new task
But forgets how to do basic things
Solutions:
# Use LoRA instead of full fine-tuning
# Keeps base model frozen
# Or: Mix in general examples
training_data = domain_specific_data + general_data
Pitfall 3: Poor Data Quality
Symptom:
Model learns bad patterns
Inconsistent outputs
Solutions:
# 1. Manual review sample
review_random_sample(data, n=100)
# 2. Automated checks
def check_quality(example):
checks = {
"too_short": len(example["output"]) < 20,
"repetitive": has_repetition(example["output"]),
"formatting": not well_formatted(example["output"])
}
return checks
# 3. Remove low-quality examples
high_quality_data = [ex for ex in data if passes_quality_checks(ex)]
Pitfall 4: Wrong Baseline
Symptom:
Fine-tuned model only 2% better
(But you spent 2 weeks fine-tuning)
Solution:
ALWAYS establish baseline first:
1. Try prompt engineering
2. Try few-shot learning
3. Try RAG
4. THEN consider fine-tuning
Only fine-tune if baseline < 80% and you need >90%
Quick Reference Checklist ✅
BEFORE FINE-TUNING:
☐ Tried prompt engineering?
☐ Tried RAG?
☐ Baseline performance < 80%?
☐ Have 500+ quality examples?
☐ Clear success metrics defined?
☐ Budget allocated?
CHOOSING METHOD:
☐ Unlimited budget? → Full fine-tuning
☐ Normal budget? → LoRA
☐ Low budget? → QLoRA
☐ API user? → OpenAI/Anthropic fine-tuning
DATA PREPARATION:
☐ Format validated?
☐ Train/val/test split done?
☐ Quality checked?
☐ Diverse examples?
☐ Edge cases included?
TRAINING:
☐ Start with 1 epoch
☐ Monitor validation loss
☐ Save checkpoints
☐ Log everything
EVALUATION:
☐ Test on HELD-OUT data
☐ Compare to baseline
☐ Qualitative review
☐ Error analysis
☐ A/B test in production
DEPLOYMENT:
☐ Gradual rollout
☐ Monitoring in place
☐ Rollback plan ready
☐ Cost tracking enabled
Conclusion 🎯
Fine-tuning is powerful but not always necessary:
When to fine-tune:
- Specialized domain with <80% baseline performance
- Need consistent format/style
- Have 500+ quality examples
- ROI justifies effort
When NOT to fine-tune:
- Prompt engineering can solve it
- RAG is better for knowledge
- <100 examples
- Task changes frequently
- Base model already excellent
Best practices:
- Start simple (prompt engineering)
- Use LoRA unless you have specific reason not to
- Quality > quantity for data
- Always measure against baseline
- Monitor in production
Next steps:
- Read Part 1: Understanding Tokens
- Read Part 2: LLM Architectures
- Try fine-tuning with small dataset first
- Scale up based on results
Further Resources:
Top comments (0)