Hey folks 👋,
Today we're diving into something pretty wild: fine-tuning any model in just minutes using Unsloth.
Yeah, you heard that right. Most people make fine-tuning sound like a week-long nightmare of GPU configs, endless training loops, and mysterious error logs... but Unsloth flips the script and makes it stupidly easy. (If the model is too large, it may take some time ⏳ so start small first! 🚀)
Before we roll up our sleeves, let's first get on the same page about:
- What fine-tuning actually is 🤔
- Why you'd even bother doing it
- And finally, a quick Fine-tuning vs. RAG comparison to see where each shines.
By the end of this post, you'll realize fine-tuning isn't some scary AI wizardry anymore it's just another tool you can wield in your workflow... and Unsloth might just be your new favorite weapon.
What is Fine-Tuning?
Okay, let's strip away the jargon for a second.
Think of a big language model like a super-talented intern. They've read every book, article, and Reddit thread out there but they still don't really know your company's tone, your niche topic, or the weird acronyms you use every day.
Fine-tuning is basically giving that intern a crash course tailored to your needs.
You feed them examples, show them how you want things done, and after some focused training, they start producing answers like they've been working with you for years.
In AI terms:
- You take a pre-trained model (already knows a lot about the world).
- You retrain it on your own data (so it learns your specific style, knowledge, or use case).
- End result: a model that feels custom-made for you.
Without fine-tuning, you can still ask the model questions, but it's like hiring that intern without ever showing them how you operate. With fine-tuning? They're now part of the family.
Why We Need Fine-Tuning
Here's the thing: out-of-the-box AI models are like Swiss Army knives.
They can do everything, but they're not razor-sharp at the one thing you really care about.
If you:
- Have a specialized domain (medical, legal, crypto, anime lore… whatever your jam is).
- Want consistent tone & style (no more AI mood swings between Shakespeare and Twitter slang).
- Need faster, more accurate answers without over-explaining or hallucinating.
…then fine-tuning is your secret weapon.
Sure, you could just keep prompting the base model with long instructions every single time, but that's like telling your barista your entire coffee order every single morning. Fine-tuning is like getting them to remember it and have it ready before you even walk in.
Bottom line: it saves time, improves accuracy, and makes the AI yours.
Fine-Tuning vs RAG - The Showdown
Okay, so you've got two big guns in the AI toolbox: Fine-Tuning and RAG (Retrieval-Augmented Generation).
Both can make your model smarter but they work very differently.
Fine-Tuning 🧠
You take the base model, teach it your data, and it remembers it forever.
Pros: Lightning-fast responses, no extra database calls, perfect for style/tone training.
Cons: Need to retrain if your data changes a lot, uses storage space.
It's like teaching your friend all your inside jokes they just get you without explanation.
RAG 📚
Instead of retraining, you keep your data in a separate database, and the model fetches the right info every time you ask.
Pros: Always up-to-date, great for large & frequently changing data.
Cons: Slower than fine-tuning, needs a solid search system, tone/style training is harder.
It's like your friend always Googling stuff before answering you accurate, but takes a second.
When to use which?
- Fine-Tuning: If you want personality, speed, and style consistency.
- RAG: If your data changes daily or is huge.
- Both together: Now we're talking god-tier AI.
Types of Fine-Tuning
Fine-tuning isn't a one-size-fits-all game. You can decide how deep you want to tweak the model's brain.
1️⃣ Full Fine-Tuning 🏋️♂️
You retrain all the model's weights.
Pros: Maximum control, can teach completely new behaviors.
Cons: Heavy on compute & time.
Think of it like giving the model a full brain transplant.
2️⃣ Partial / Layer-Specific Fine-Tuning 🎯
Instead of touching the whole network, you only tweak specific layers often the last few.
Pros: Faster, cheaper, less risk of "forgetting" old knowledge.
Cons: Limited changes can't totally change its nature.
Like reprogramming just the "decision-making" part of the brain while keeping the senses the same.
3️⃣ LoRA / Parameter-Efficient Fine-Tuning 🪶
LoRA (Low-Rank Adaptation) adds small trainable adapters to the model without changing most of the original weights.
Pros: Tiny storage size, super-fast, can stack multiple LoRAs.
Cons: Slightly less flexible than full fine-tuning.
Think of it like giving the model extra plug-in skills without rewriting its core brain.
4️⃣ Prompt-Tuning / Prefix-Tuning 📝
You don't even touch the main model weights you just train special "prompt vectors" that guide its behavior.
Pros: Ultra-lightweight, instant swapping between tasks.
Cons: Works best for style & specific formats, not huge domain shifts.
Like giving your friend a script before they speak.
Okayy… enough theory! 🥱 I know your brain's already halfway to coffee mode, so let's just skip the fluff and jump straight into the good stuff.
Time to roll up our sleeves and fine-tune like a pro but here's the plot twist: it's going to be so ridiculously easy with Unsloth that you might start questioning everything you thought you knew about AI workflows.
Let's dive in. 🚀
First things first let's set up our environment.
No 20-step "install this, patch that" nonsense just a clean, minimal setup that gets you from zero to fine-tuning in minutes.
We'll:
- Install Unsloth.
- Load a pre-trained LLaMA model.
- Feed it our dataset.
- Hit fine-tune… and boom 💥you're done before your coffee cools.
Here's how it starts:
!pip install "unsloth[colab-new]==2025.8.1"
!pip install --no-deps xformers triton
I've tried this exact version (2025.8.1
) and it works fine for me ✅. If, for some reason, it doesn't work on your setup (maybe due to Colab updates or dependency changes), just drop the fixed version number and install the latest one instead. That way, you'll always get the newest compatible release.
Data: Your Model's Training Fuel
I used some random sample data just to show how things work but you can totally use your own customized data. Just make sure to follow the same format so Unsloth knows how to read it.
Example format:
dataset = [
{
"instruction": "Explain the process of photosynthesis in simple terms.",
"input": "What is photosynthesis?",
"output": "Photosynthesis is the process plants use to convert light energy into chemical energy. They take in carbon dioxide, water, and sunlight, and use them to create glucose (their food) and release oxygen as a byproduct."
},
{
"instruction": "What is the significance of the Magna Carta?",
"input": "Why is the Magna Carta important?",
"output": "The Magna Carta, signed in 1215, is a foundational document in the history of democracy. It established the principle that everyone, including the king, was subject to the law, and it guaranteed individuals' rights to justice and a fair trial."
},
{
"instruction": "Describe the function of a CPU in a computer.",
"input": "What does a CPU do?",
"output": "The CPU, or Central Processing Unit, is the primary component of a computer that acts as its 'brain.' It performs most of the processing inside a computer, executing instructions and carrying out calculations for all other components."
},
{
"instruction": "What is the difference between nuclear fission and fusion?",
"input": "Explain fission vs. fusion.",
"output": "Nuclear fission is the splitting of a large, unstable atomic nucleus into smaller nuclei, which releases energy. Nuclear fusion is the process where two light atomic nuclei combine to form a heavier nucleus, releasing even more energy. Fission is used in power plants today, while fusion powers the sun."
},
{
"instruction": "Explain the concept of inflation.",
"input": "What is inflation?",
"output": "Inflation is the rate at which the general level of prices for goods and services is rising, and subsequently, purchasing power of currency is falling. For example, if inflation is 2%, a $1 coffee will cost $1.02 next year."
},
// Add more data as you want
{
"instruction": "If you have a bowl with six apples and you take away four, how many do you have?",
"input": "Apple riddle.",
"output": "You have four apples. You are the one who took them."
},
{
"instruction": "What is the flaw in this argument: 'All birds can fly. A penguin is a bird. Therefore, a penguin can fly.'",
"input": "Find the flaw in the logic.",
"output": "The flaw is in the first premise, 'All birds can fly.' This is an incorrect generalization, as some birds, like penguins and ostriches, cannot fly."
},
{
"instruction": "Mary's father has five daughters: Nana, Nene, Nini, Nono. What is the name of the fifth daughter?",
"input": "Daughter name riddle.",
"output": "The fifth daughter's name is Mary."
}
]
Keep it clean, relevant, and consistent the model will thank you later.
The Fine-Tuning Script (a.k.a. Where the Magic Happens)
Alright, here's the full fine-tuning setup I used. It's built to be clean, minimal, and easy to follow no endless config files, no arcane terminal spells.
Here's what's going on:
- Setup & Check GPU - Import dependencies, check your PyTorch version, and see if CUDA is available.
-
Prepare the Data - Save your dataset in
.jsonl
format. I used a small sample dataset here, but you can replace it with your own (just follow the same instruction, input, output format). - Quantization (4-bit) - This makes the model lighter, faster, and cheaper to train without losing much accuracy.
-
Load Model - I'm using
TinyLlama-1.1B-Chat
for speed. - LoRA Configuration - Instead of fine-tuning the whole model, we only train a small set of adapter layers way faster and more memory-friendly.
- Tokenization - Convert our text into token IDs so the model can understand it.
- Training Arguments - Batch size, learning rate, epochs, logging, etc. are all set here.
-
Train & Done - Press run, watch the magic, and your fine-tuned model will be saved in the
finetuned_model_working
folder.
💡 Pro tip: If the model size is too big for your GPU, start with something tiny like TinyLlama
or phi-2
. You'll still get the hang of fine-tuning without melting your Colab session 🔥.
import os
import json
import torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling, BitsAndBytesConfig
from datasets import load_dataset
print(f"Torch Version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Create a directory and save the dataset to a JSONL file
os.makedirs("data", exist_ok=True)
with open("data/sample.jsonl", "w") as f:
for ex in dataset:
f.write(json.dumps(ex) + "\n")
print("Sample dataset saved.")
# Define the quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
# Load the model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=2048,
quantization_config=quantization_config,
)
# Configure the model for LoRA
model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=["q_proj", "v_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Load the dataset from the JSONL file
jsonl_dataset = load_dataset("json", data_files="data/sample.jsonl", split="train")
# Define the tokenization function
def tokenize_fn(example):
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
tokenized = tokenizer(
prompt,
truncation=True,
padding="max_length",
max_length=512,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
# Apply the tokenization function to the dataset
tokenized_dataset = jsonl_dataset.map(tokenize_fn, num_proc=4)
# Define the training arguments
training_args = TrainingArguments(
output_dir="finetuned_model_working",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=5,
num_train_epochs=10,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
optim="adamw_8bit",
lr_scheduler_type="cosine",
report_to="none",
)
# Define the data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator,
)
# Start the training process
print("Starting training...")
trainer.train()
print("Training complete!!!")
If you want, read this code explanation for better clarity or skip it and just run the whole thing. It'll work either way. 🚀
- Import the tools
import os, json, torch
from unsloth import FastLanguageModel
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling, BitsAndBytesConfig
from datasets import load_dataset
We're bringing in all the libraries we need unsloth
for our magic fine-tuning, Hugging Face tools for training, and torch
to check GPU.
- Check environment
print(f"Torch Version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
Just a quick "are we ready?" check to see PyTorch version and GPU availability.
- Save dataset
os.makedirs("data", exist_ok=True)
with open("data/sample.jsonl", "w") as f:
for ex in dataset:
f.write(json.dumps(ex) + "\n")
print("Sample dataset saved.")
We make a folder called data
and save our dataset as sample.jsonl
the format Unsloth expects.
- Quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
Loads the model in 4-bit mode to save VRAM, run faster, and still keep accuracy decent.
- Load the base model
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=2048,
quantization_config=quantization_config,
)
We pull in TinyLLaMA
small, fast, and perfect for demos.
- Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=8,
target_modules=["q_proj", "v_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
Instead of retraining the entire model, we fine-tune just a few adapter layers (LoRA). This is way faster and needs less memory.
- Load dataset for training
jsonl_dataset = load_dataset("json", data_files="data/sample.jsonl", split="train")
Reads our .jsonl
file so we can feed it into the model.
- Tokenize the text
def tokenize_fn(example):
prompt = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
tokenized = tokenizer(
prompt,
truncation=True,
padding="max_length",
max_length=512,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
tokenized_dataset = jsonl_dataset.map(tokenize_fn, num_proc=4)
Turns human-readable text into token IDs so the model can understand it. We also duplicate input_ids
as labels
for training.
- Training settings
training_args = TrainingArguments(
output_dir="finetuned_model_working",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=5,
num_train_epochs=10,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
optim="adamw_8bit",
lr_scheduler_type="cosine",
report_to="none",
)
Here's where we set batch size, learning rate, number of epochs, optimizer, and mixed precision (FP16/BF16).
-
Data collator
data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, )
Batches up data for the model
mlm=False
because we're doing causal (chat-style) training, not masked language modeling. -
Trainer setup
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset, data_collator=data_collator, )
This is Hugging Face's high-level training wrapper handles batching, logging, checkpointing, and more.
-
Train!
print("Starting training...") trainer.train() print("Training complete!!!")
Hit run, watch your GPU work, and in a few minutes you'll have a fine-tuned model ready in
finetuned_model_working/
.
After the training is complete, you'll end up with a fine-tuned model folder containing multiple files such as config.json
, pytorch_model.bin
, tokenizer.json
, etc.
To make this model accessible publicly or use it anywhere, you'll need to upload it to Hugging Face.
Here's how you do it:
1. Login to Hugging Face
- Go to huggingface.co and sign in to your account.
2. Create a New Model Repository
- On your profile sidebar, click "New Model".
- Give your model a unique name (for example,
FineTuned_TinyLLM
). - Choose whether it should be public or private.
3. Upload the Files
Option 1: Manual Upload
- Click "Add file" → "Upload files".
- Select all the files from your fine-tuned folder and upload them.
Option 2: Command Line Upload (requires huggingface_hub
library)
huggingface-cli login
git clone https://huggingface.co/username/FineTuned_TinyLLM
cd FineTuned_TinyLLM
cp -r /path/to/your/fine_tuned_model/* .
git add .
git commit -m "Upload fine-tuned model"
git push
4. Wait for Hugging Face to Process
- Within a few minutes, your model will be live and ready to use.
The "Magic Bunny" Moment
Now you can load your fine-tuned model from anywhere using just its Hugging Face model name, as shown in the code below.
Code to Use the Uploaded Model
import torch
from unsloth import FastLanguageModel
model_name = "username/FineTuned_TinyLLM" # Your Hugging Face model name
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=2048,
)
def generate_response(instruction, input_text):
prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
response_start = response.find("### Response:\n") + len("### Response:\n")
extracted_response = response[response_start:].strip()
return extracted_response
instruction = "What is the significance of the Magna Carta?"
input_text = "Why is the Magna Carta important?"
response = generate_response(instruction, input_text)
print("Generated Response:")
print(response)
Conclusion
That's it, folks fine-tuning is actually much simpler than it looks. Once your model is trained, the process of uploading to Hugging Face and running it in your own projects is just a matter of following the steps. With this approach, you can take even a small base model and make it incredibly smart in your own niche, without needing supercomputers or million-dollar budgets.
The beauty of this workflow is that it opens the door for personalized AI whether you're building a chatbot, a knowledge assistant, or a specialized text generator, fine-tuning gives you complete creative control. Once you see your fine-tuned model generate results in your style, you'll realize that the "magic bunny" really does jump out of the hat. 🪄🐇
Credit: This method of using Unsloth for fine-tuning is inspired by AI with Thiru. He's the one who introduced me to this approach, and I highly recommend checking out his video for better clarity and deeper insights he explains it in a way that clicks instantly.
🔗 Connect with Me
📖 Blog by Naresh B. A.
👨💻 Aspiring Full Stack Developer | Passionate about Machine Learning and AI Innovation
🌐 Portfolio: [Naresh B A]
📫 Let's connect on [LinkedIn] | GitHub: [Naresh B A]
💡 Thanks for reading! If you found this helpful, drop a like or share a comment feedback keeps the learning alive.
Top comments (0)