Berkan Sesen

Posted on Apr 15 • Originally published at sesen.ai

AI Experts Are Dead. Long Live the AI Experts.

#llm #deeplearning #optimisation

Last month, my eight-year-old built a Flappy Bird clone from scratch. He can't really type yet. He certainly can't write Python. What he can do is talk to Claude while I whisper in his ear what to say next. Within an hour, he had a working game: a bird, pipes, a score counter, gravity. He's an "AI expert" now.

And honestly? So is your dentist, your cousin's teenager, and the recruiter who just messaged you on LinkedIn. The barrier to "using AI" has collapsed to the cost of typing a sentence in English. This is genuinely wonderful. Democratisation of powerful technology is how we got the internet, smartphones, and open-source software.

But there's an asymmetry hiding behind this accessibility: while using AI has never been cheaper, building AI has never been more expensive. Training GPT-4 cost over $100 million. Llama 3 required 24,000 GPUs running for months. The companies that can afford to train foundation models from scratch fit comfortably in a single conference room. We've democratised the interface and monopolised the engine.

So where does that leave the engineers, the domain experts, the people who actually know things about medicine, law, finance, or logistics? Somewhere in between. And that somewhere has a name: fine-tuning. For a few hundred to a few thousand dollars, you can take a foundation model and make it yours, trained on your data, speaking your domain's language, following your formatting rules. Not building the engine from scratch, but tuning it to your track.

By the end of this post, you'll fine-tune a model on Azure OpenAI, understand the LoRA algorithm that makes it computationally feasible, and know exactly where fine-tuning sits in the hierarchy from prompt engineering to pre-training.

Quick Win: Fine-Tune on Azure in 20 Lines

Let's start with the punchline. Here's everything you need to fine-tune a GPT model on Azure OpenAI. No Colab badge here (you'll need Azure credentials), but the code itself is almost disappointingly simple.

Prepare Your Training Data

Azure expects JSONL format: one JSON object per line, each containing a conversation. Here's what training data looks like for a medical coding assistant:

{"messages": [{"role": "system", "content": "You are a medical coding assistant. Map clinical descriptions to ICD-10 codes."}, {"role": "user", "content": "Patient presents with acute bronchitis"}, {"role": "assistant", "content": "J20.9 — Acute bronchitis, unspecified"}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant. Map clinical descriptions to ICD-10 codes."}, {"role": "user", "content": "Type 2 diabetes with diabetic chronic kidney disease"}, {"role": "assistant", "content": "E11.22 — Type 2 diabetes mellitus with diabetic chronic kidney disease"}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant. Map clinical descriptions to ICD-10 codes."}, {"role": "user", "content": "Essential hypertension"}, {"role": "assistant", "content": "I10 — Essential (primary) hypertension"}]}

Each line is a complete conversation with a system prompt, user input, and the desired assistant response.

Upload Data and Launch Fine-Tuning

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://YOUR_RESOURCE.openai.azure.com/",
    api_key="YOUR_API_KEY",
    api_version="2025-03-01-preview",
)

# Upload training file
training_file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune",
)

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
)

print(f"Job ID: {job.id}")
print(f"Status: {job.status}")

Check Status and Use Your Model

# Check progress
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job.status}")  # queued → running → succeeded

# Once succeeded, deploy and use your model
if job.status == "succeeded":
    response = client.chat.completions.create(
        model=job.fine_tuned_model,
        messages=[
            {"role": "system", "content": "You are a medical coding assistant."},
            {"role": "user", "content": "Chronic obstructive pulmonary disease with acute exacerbation"},
        ],
    )
    print(response.choices[0].message.content)
    # J44.1 — COPD with (acute) exacerbation

That's it. Azure handles the infrastructure, the training loop, the checkpointing, and the deployment. You provide the data; it returns a model that speaks your domain.

What Just Happened?

Behind those 20 lines, a lot happened. Let's unpack it.

The JSONL Format

Each training example is a conversation in the chat completions format you already know. The key fields:

role: "system": Sets the persona. Keep this consistent across examples.
role: "user": The input your model will receive in production.
role: "assistant": The exact output you want the model to learn.

You can optionally add a "weight": 0 field to any message to exclude it from the loss computation. This is useful when you want the model to see context but only learn from specific responses.

The Training Pipeline

When you call client.fine_tuning.jobs.create(), Azure kicks off a pipeline:

Validation: Checks your JSONL for formatting errors, token limits, and minimum example counts (at least 10 examples required).
Queuing: Your job waits for GPU capacity.
Training: The model is fine-tuned using LoRA (more on this shortly). Azure automatically creates checkpoints.
Results: A results.csv file is generated with training and validation loss at each step.

Hyperparameters

You can customise the training with the hyperparameters argument:

Parameter	Default	What It Controls
`n_epochs`	Auto (based on dataset size)	Number of passes through the training data
`learning_rate_multiplier`	Auto	Scales the base learning rate. Higher means faster but riskier.
`batch_size`	Auto	Examples per gradient update. Larger is more stable but uses more memory.
`seed`	None	For reproducibility

job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "learning_rate_multiplier": 1.8,
        "batch_size": 4,
    },
)

For most use cases, the defaults work well. Azure auto-selects values based on your dataset size. If you want to systematically search for optimal values, hyperparameter optimisation methods like Bayesian optimisation can help.

Pricing

Fine-tuning costs vary by model tier:

Tier	Training Cost	Hosting Cost	Best For
Standard	Higher per-token	Dedicated deployment	Production workloads
Global Standard	Moderate	Pay-per-use	Cost-effective production

Training costs are measured in tokens processed. A dataset of 1,000 examples at ~200 tokens each, trained for 3 epochs, processes about 600K tokens, typically a few dollars for smaller models like gpt-4o-mini.

Going Deeper

What Fine-Tuning Actually Does

A language model is a probability distribution over the next token. When you prompt GPT-4o with "The capital of France is", it assigns high probability to "Paris" and low probability to "pizza". These probabilities are determined by the model's weights, billions of numbers learned during pre-training.

Fine-tuning shifts these probability distributions. For our medical coding assistant, the base model might assign:

P("J20.9") = 0.001 (it's seen ICD codes, but rarely)
P("The patient has") = 0.15 (a more "natural" continuation)

After fine-tuning on hundreds of medical coding examples, the distribution shifts:

P("J20.9") = 0.85
P("The patient has") = 0.002

The training objective is cross-entropy loss on the assistant tokens, the same maximum likelihood estimation objective used in pre-training, just applied to a much smaller dataset. The model learns to maximise the probability of producing exactly the outputs in your training data.

The gradients that update the weights flow through the same backpropagation algorithm used in pre-training. The difference is scope: pre-training processes trillions of tokens across the entire internet; fine-tuning processes thousands of tokens from your specific domain.

LoRA: The Algorithm Under the Hood

Here's the problem with naive fine-tuning: GPT-4o has hundreds of billions of parameters. Updating all of them requires enormous GPU memory and risks catastrophically forgetting what the model already knows. This is where LoRA (Low-Rank Adaptation) comes in, and it's what Azure uses under the hood.

The key insight from Hu et al. (2021): when you fine-tune a large language model, the weight updates have low intrinsic rank. In plain English, the changes needed to adapt a model to a new task live in a much smaller subspace than the full parameter space.

Instead of updating a weight matrix $W \in \mathbb{R}^{d \times k}$ directly, LoRA decomposes the update into two smaller matrices:

Where:

$W$ is the original frozen weight matrix ( $d \times k$ )
$B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are the trainable low-rank matrices
$r \ll \min(d, k)$ is the rank, typically 8, 16, or 64

The parameter reduction is dramatic. Consider a weight matrix in a large transformer:

Original: $d = 4096, k = 4096$ → 16.8 million parameters
LoRA with $r = 16$ : $(4096 \times 16) + (16 \times 4096)$ = 131,072 parameters
Reduction: 99.2%

Across the entire model, LoRA typically trains only 0.1–1% of the original parameters. This means:

Memory: You can fine-tune on a single GPU instead of a cluster.
Speed: Fewer parameters to update means faster training.
Storage: Each fine-tuned version is just the small $B$ and $A$ matrices, megabytes instead of gigabytes.
No extra inference latency: At deployment, $BA$ is merged back into $W$ . The final model has exactly the same architecture and speed as the original.

The initialisation matters too: $B$ is initialised to zero and $A$ to random Gaussian, so $\Delta W = BA = 0$ at the start. Training begins from the exact pre-trained model, with no disruption.

The PEFT Family

LoRA belongs to a broader family called Parameter-Efficient Fine-Tuning (PEFT) methods. Here's how they compare:

Method	Trainable Params	Memory	Quality	Inference Overhead
Full fine-tuning	100%	Very high	Best	None
LoRA	0.1–1%	Low	Near-full	None (merged)
QLoRA	0.1–1%	Very low	Good	Slight (quantisation)
Prefix tuning	Under 0.1%	Very low	Moderate	Slight (extra tokens)
Adapters	1–5%	Low	Good	Slight (extra layers)

LoRA is the most popular because it hits the sweet spot: near-full fine-tuning quality with no inference overhead. QLoRA adds 4-bit quantisation of the base model, reducing memory further. You can fine-tune a 65B parameter model on a single 48GB GPU. Prefix tuning prepends learnable "virtual tokens" to the input, but quality degrades for complex tasks. Adapters insert small trainable layers between existing transformer blocks.

The Moat Spectrum

Not all AI customisation is equal. Here's the full spectrum, from least to most defensible:

Approach	Cost	Setup Time	Moat Strength	When to Use
Prompt engineering	Free	Minutes	None	Prototyping, one-off tasks
RAG (Retrieval-Augmented Generation)	$10s–100s/mo	Days	Weak (data can be copied)	Need current information, citations
Fine-tuning	$100s–1,000s	Days–weeks	Moderate (behaviour is learned)	Consistent formatting, domain tone, cost at scale
Pre-training	$10Ms–100Ms+	Months	Strong (architecture + data)	You're OpenAI, Google, or Meta

Prompt engineering is where most people stop. It works surprisingly well but offers zero competitive moat; anyone can copy your prompt. RAG adds your own data at inference time, which is powerful for knowledge-intensive tasks but the behaviour is still the base model's. Fine-tuning embeds behaviour into the weights. The model doesn't need to be told how to respond; it just does. Pre-training is building the engine from scratch, and unless you have a few hundred million dollars and a research lab, it's not your game.

When Fine-Tuning Beats Prompting

Fine-tuning wins over prompting when you need:

Consistent output formatting. JSON schemas, code conventions, structured reports. A fine-tuned model follows the format without lengthy system prompts.
Domain-specific behaviour. Medical coding, legal analysis, financial compliance. The model internalises domain norms.
Tone and style. Brand voice, technical writing style, conversational patterns.
Cost at scale. A fine-tuned model with a short prompt is cheaper per request than a base model with a 2,000-token system prompt.
Latency. Shorter prompts mean fewer input tokens to process.

When Fine-Tuning Loses to RAG

Fine-tuning embeds knowledge into weights, but weights are frozen after training. If the information changes frequently (stock prices, medical guidelines, product catalogues), RAG is the better choice. RAG retrieves current documents at inference time, so the model always has access to the latest information.

The best systems often combine both: fine-tune for behaviour (how to respond), RAG for knowledge (what to respond with).

Where This Comes From

LoRA: Low-Rank Adaptation of Large Language Models

LoRA was introduced by Hu et al. (2021) at Microsoft Research. The paper's central hypothesis is elegant:

"We hypothesize that the change in weights during model adaptation also has a low 'intrinsic rank,' which leads us to propose Low-Rank Adaptation (LoRA)."

The authors demonstrated that for GPT-3 175B, LoRA with rank 4 matched or exceeded full fine-tuning performance on multiple benchmarks while training only 0.01% of the parameters. They tested on natural language understanding (GLUE), natural language generation (E2E NLG), and instruction following. LoRA matched full fine-tuning across the board.

A key practical insight from the paper: LoRA is most effective when applied to the attention weight matrices ( $W_Q$ and $W_V$ ), rather than the feed-forward layers. This is because attention matrices control the model's "routing" of information (which tokens attend to which), and task-specific behaviour is largely about changing these routing patterns.

ULMFiT: The Transfer Learning Paradigm

Before LoRA, there was ULMFiT. Howard & Ruder (2018) established the now-standard paradigm: pre-train on a large corpus, then fine-tune on your task. Their key innovations, discriminative fine-tuning (different learning rates per layer) and gradual unfreezing, are the conceptual ancestors of LoRA's approach.

The Broader Lineage

The idea that pre-trained representations can be adapted to new tasks has a long history:

ImageNet transfer learning (2012–2014). Training on ImageNet, fine-tuning on medical images. Computer vision proved the concept.
ULMFiT (2018). Brought transfer learning to NLP. Demonstrated that language model pre-training produces universal features.
BERT (2018) and GPT (2018). Scaled the paradigm. Pre-train once, fine-tune for everything.
LoRA (2021). Made fine-tuning efficient enough for massive models. You don't need to update every parameter.

Each step reduced the barrier. LoRA's contribution is making fine-tuning feasible for models so large that full fine-tuning would require a cluster.

Try It Yourself

The format experiment. Take a task where you want structured output (e.g., JSON with specific fields). Compare: (a) a detailed system prompt describing the format, vs (b) a fine-tuned model trained on 50 examples of the correct format. Measure how often each produces valid output.
Data quality vs quantity. Create two training sets for the same task: 50 carefully curated, high-quality examples vs 500 noisy, auto-generated examples. Fine-tune on each. Quality almost always wins. This is the moat.
The moat test. Fine-tune a model on a specific domain task. Then try to replicate the same behaviour using only prompt engineering. How close can you get? Where does prompting fall short?
LoRA from scratch. Implement a toy LoRA layer in PyTorch. Freeze a pre-trained GPT-2 model, add $BA$ matrices to the attention layers, and fine-tune on a small text classification task. Compare the parameter count to full fine-tuning.

Interactive Tools

Explore Our Free Tools — Hands-on calculators and visualisers for statistics, machine learning, and quantitative finance

Maximum Likelihood Estimation from Scratch. Fine-tuning's loss function (cross-entropy) is maximum likelihood estimation. Understanding MLE gives you intuition for what the training loop is optimising.
Backpropagation and Neural Nets from First Principles. The gradient computation that makes both pre-training and fine-tuning work. LoRA reduces the number of parameters, but the gradients still flow through the same algorithm.
Hyperparameter Optimisation: Grid, Random, and Bayesian. Fine-tuning has its own hyperparameters (learning rate, epochs, LoRA rank). This post covers systematic approaches to tuning them.

Frequently Asked Questions

What is the difference between fine-tuning and prompt engineering?

Prompt engineering gives instructions to a base model at inference time, while fine-tuning embeds behaviour directly into the model's weights through additional training. Fine-tuning produces more consistent outputs without lengthy system prompts and can reduce per-request costs at scale. However, prompt engineering requires zero setup and is the right starting point for prototyping.

How much training data do I need for fine-tuning?

Azure OpenAI requires a minimum of 10 examples, but practical results typically need 50 to 500 high-quality examples depending on task complexity. Data quality matters far more than quantity: 50 carefully curated examples often outperform 500 noisy ones. Start small, evaluate, and add more data only if the model underperforms.

Does fine-tuning change the entire model?

No. Modern fine-tuning uses LoRA (Low-Rank Adaptation), which freezes the original model weights and trains only small low-rank matrices added to the attention layers. This typically updates only 0.1 to 1% of the original parameters, making fine-tuning feasible on modest hardware while preserving the base model's general capabilities.

Can I fine-tune open-source models instead of using Azure?

Yes. Open-source models like Llama and Mistral can be fine-tuned locally using libraries such as Hugging Face PEFT and QLoRA. The LoRA algorithm is the same regardless of platform. The trade-off is that you manage the infrastructure yourself, but you gain full control over the model and avoid ongoing API costs.

When should I use RAG instead of fine-tuning?

Use RAG (Retrieval-Augmented Generation) when the knowledge your model needs changes frequently, such as product catalogues, medical guidelines, or pricing data. Fine-tuning embeds knowledge into frozen weights, so it cannot adapt to new information without retraining. The best systems often combine both: fine-tune for consistent behaviour and formatting, then use RAG to inject up-to-date knowledge at inference time.

What is QLoRA and how does it differ from LoRA?

QLoRA combines LoRA with 4-bit quantisation of the base model, reducing memory requirements even further. With QLoRA, you can fine-tune a 65-billion parameter model on a single 48GB GPU. The trade-off is a slight quality reduction from quantisation and marginally higher inference latency compared to standard LoRA.

DEV Community