LLMs are very good at generating responses to user's queries out of the box, but these are general responses coming from the general training data fed to them.
But with fine-tuning, you teach LLMs to speak your language, follow your rules, and deliver answers that are custom-built for your needs. You get to tailor LLM responses to your tone, jargon, or exact way of doing things.
💡 What is Fine-Tuning
Fine-tuning means taking a base model (GPT-4 in this case), and training it further with your own examples so it learns to follow your patterns, formats, or tone.
The difference from prompting:
Prompting: You tell the model exactly what you want every single time.
Fine-tuning: You teach the model what you want once, and it remembers that style or behavior forever (until you retrain it).
When To fine-tune:
- You want a consistent tone or voice in all responses.
- You want responses tailored to a domain with special jargon (legal, medical, fintech, etc.).
- You need highly structured outputs every time.
- You want shorter prompts and faster responses for repeated tasks.
When NOT To fine-tune:
- Your information changes frequently (product prices, live news).
- You only need small, one-off adjustments.
- You want to add knowledge or facts. For that, use a Retrieval-Augmented Generation (RAG) setup instead.
📌 How Fine-Tuning Works
Fine-tuning is like taking a model that has already read the whole internet (that’s the pre-training stage) and then giving it extra special lessons
so it responds a certain way you want.
Here’s the flow:
- Pre-training data - Massive amounts of general text (books, websites, articles) used to train a LLM.
- This first training produce a Base LLM (like GPT-4) that is a generalist, who knows everything in theory.
- Fine-tuning data - Carefully prepared examples to teach a base model your tone, format, and special rules.
- This second training produce a Fine-tuned LLM, in this case, the same GPT-4 brain, but with your custom behavior layered on it.
- You can send Prompts to the fine-tuned model, and it gives an Output that matches your style without needing long instructions.
📌 Performance versus Investments
Before considering Fine-Tuning, there are other methods to use LLMs. Each method has its own advantages and tradeoffs. You decide base on your need and goals.
Here’s the Performance vs Investment chart of some methods of using LLM:
- Prompting requires the lowest investment but has low performance. It yields very generic responses.
- One-shot/Few-shot Prompting is slightly better but still requires low investment. Giving examples can improve the model, but it still relies on you to provide the right ones each time.
- Fine-tuning requires much higher investment but has high performance jump because model is trained specifically on your preferences.
- Pre-training requires the most investment. It builds the base model which can be fine-tuned.
📌 Step-by-Step GPT-4 Fine-Tuning Process
Data is the foundation of fine-tuning. Your dataset is the heart of your model. It's what gives it that unique knowledge and voice.
1. ✍️ Get Your OpenAI API Key and Setup OpenAI
I used Google Collab because this requires GPU, so here is how I set this up
from google.colab import userdata
api_key = userdata.get('openai_api')
from openai import OpenAI
# Connect to the OpenAI api
openai = OpenAI(api_key=api_key)
2. ✍️ Prepare Your Dataset
To effectively train a model, you must provide examples of desired interactions and organize the datasets into these three key parts:
- System Prompt: This defines the guidelines for every responses. It is defined in the system role of the interaction cycle.
- User Prompt: This is what the user will ask to trigger the prompt. This is appended to the user role of the interaction cycle.
- Assistant Response: This is the response that we expect the model to learn how to generate based on the system and user prompt. This takes the assistant role of the interaction cycle because LLMs are assistants.
{"messages": [
{"role": "system", "content": "You are a helpful travel assistant."},
{"role": "user", "content": "Best time to visit Japan?"},
{"role": "assistant", "content": "The best time to visit Japan is spring (March–May) or autumn (September–November)."}
]}
These are put in a JSONL structure. Each line of the JSONL data represents a full interaction cycle with an LLM interface. You need many of this interaction cycles as example data to train the Base LLM and produce a Fine-Tuned LLM.
Tips for dataset quality:
- Use clear, consistent formatting.
- Avoid typos or mixed instructions.
- Include hundreds to thousands of diverse examples for better results.
3. ✍️ Split The Dataset For Training and Validation
You can generate the dataset using LLMs known as synthetic data
or prepare them out manually.
Which ever way, you need to shuffle and 2 files of the datasets.
- For training - Training datasets can be 80% of the dataset.
- For validation - Testing datasets can be 20% of the dataset.
- Save each dataset as a
.jsonl
filetrain_data.jsonl
validation_data.jsonl
4. ✍️ Upload The Datasets To OpenAI
Fine-tuning doesn't happen locally or on your own machines.
You upload your data because fine-tuning occurs on OpenAI’s servers. Once uploaded, your data stays private and under your control.
def upload_file(filename: str, purpose: str) -> str:
with open(filename, "rb") as file:
response = openai.files.create(file=file, purpose=purpose)
return response.id
train_file_id = upload_file("train_data.jsonl", "fine-tune")
validation_file_id = upload_file("validation_data.jsonl", "fine-tune")
5. ✍️ Create the Fine-Tune Job
Creating the Fine-Tune Job is the launch training
step. It is when you connect your uploaded dataset with a base GPT-4 model and tell OpenAI to start customizing it for you.
This takes some time to complete. Check your openAI dashboard to confirm status.
MODEL = "<check_for_model_on_openai_dashboard"
response = openai.fine_tuning.jobs.create(
training_file=train_file_id,
validation_file=validation_file_id,
model=MODEL,
suffix="travel-model"
)
Retrieve the Fine-Tuned Model ID
When your fine-tuning job finishes, OpenAI returns a model ID that you will use when making API calls.
tuned_model_id = openai.fine_tuning.jobs.retrieve(response.id).fine_tuned_model
6. ✍️ Use Your Fine-Tuned Model
Once your fine-tuned model is ready and you have the fine-tuned model ID, using it is just like using any other OpenAI model. Just swap the model value in your API call.
# Define the system prompt
system_prompt = "You are a helpful travel assistant."
# Define a user prompt
user_prompt = "Best time to visit France?"
# Define the Messages
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
response = openai.chat.completion.create(
model=tuned_model_id,
messages=messages,
temperature=1.1
)
# Print the assistant's response
print(response.choices[0].message.content)
📌 Evaluation of a Fine-Tuned Model
Why Evaluate?
Evaluation ensures your fine-tuned model meets your goals. Without evaluation, you risk deploying a model that’s inaccurate, inconsistent, or overfitted.
1. Evaluation Metrics
-
Qualitative Metrics
- Assess tone, style, clarity, and factual accuracy by reading outputs.
- Check if responses align with your brand voice or application needs.
- Identify edge cases where the model fails or strays from requirements.
-
Quantitative Metrics
- Training loss measures how well the model fits the training data.
- Validation loss measures performance on unseen data to detect overfitting. Lower and close values between them are generally good
2. Test Prompts for Qualitative Analysis
- Prepare a set of similar user queries.
- Include typical usage prompts, edge cases, out-of-domain prompts
- Compare results with the base model and your desired behavior.
3. Iterative Improvements
If evaluation reveals weaknesses:
- Add more diverse or similar training data.
- Adjust system and user prompts to clarify intent.
- Tune temperature (lower for consistency, higher for creativity).
- Repeat fine-tuning with updated datasets and parameters.
4. Overcoming Overfitting and Poor Output Quality
- Keep datasets balanced and not overly repetitive.
- Include a validation set during fine-tuning.
- Watch for validation loss rising while training loss falls. That's a sign of overfitting.
- Mix in general-purpose prompts alongside domain-specific examples to preserve versatility.
Summary
The fine-tuning toolkit:
- Start with quality data
- Structure thoughtfully
- Evaluate and iterate
- Improve with feedback
- Know when you are ready
Here is a GitHub notebook where I trained GPT-4 to write LinkedIn posts in the style and tone I want. This provide the full practical workflow and how you can generate synthetic data as your dataset for training and validation.
Happy coding!!!
Top comments (0)