galian for Cursuri AI

Posted on May 1

Fine-Tuning LLMs in 2026: A Practical Guide for Engineers (LoRA, QLoRA, DPO, GRPO)

#ai #rag #machinelearning #python

Fine-tuning has gone from "research lab toy" to a first-class production technique for AI engineers. With LoRA-class adapters, modern alignment algorithms (DPO, GRPO, RLVR), and serving stacks like vLLM, you can ship a custom model on a single H100 — sometimes on a single 4090.

But the question isn't can you fine-tune. It's: should you?

This guide is the engineering checklist I wish I'd had two years ago. It covers the decision tree, the modern toolchain, the gotchas, and the EU compliance constraints you can't ignore in 2026.

🇪🇺 Romanian / EU readers: the full hands-on Romanian-language program is at Fine-Tuning și Adaptarea Modelelor AI — Enterprise Edition. It includes a complete end-to-end project, EU AI Act governance, and FinOps modeling.

TL;DR

Don't fine-tune first. Try prompting → RAG → fine-tuning. In that order.
LoRA / QLoRA is the default in 2026. Full fine-tuning is rarely the right call.
Alignment ≠ SFT. SFT teaches format; DPO/GRPO/RLVR teach preferences and reasoning.
Evaluation is the hard part. Loss curves don't tell you if the model is better.
Serving matters. A great fine-tune served badly is just an expensive demo.
EU AI Act applies. Document your data, your evals, and your model card.

1. When fine-tuning is actually the right tool

Most teams reach for fine-tuning too early. Here's the honest decision tree:

Problem	First try	Fine-tune only if
Inconsistent output format	Prompting + structured outputs	Format breaks > 5% even with strict prompts
Knowledge cutoff / private data	RAG (Retrieval-Augmented Generation)	RAG retrieves the right chunks but the model still misuses them
Domain-specific style/voice	System prompt + few-shot	You need it baked in across thousands of calls (latency/cost)
Specialized reasoning (math, code, legal)	Better base model + CoT	You have a clean preference dataset and need stable behavior
Tool use / agents	MCP + good prompts	Tool-call accuracy is below your SLA after prompt iteration

Rule of thumb: if you can't articulate what your fine-tune teaches that a 200-line system prompt can't, you're not ready to fine-tune.

If you're earlier in the journey, the Prompt Engineering Masterclass and Advanced LLM Integration cover the cheaper alternatives in depth.

2. The 2026 technique landscape

Full fine-tuning

Updates every parameter. Maximum capacity, maximum cost, maximum risk of catastrophic forgetting. Justified for: foundational training, large domain shifts, or when you own the inference path and the dataset is huge (>1M high-quality examples).

LoRA (Low-Rank Adaptation)

The original LoRA paper (Hu et al., 2021) is still required reading. You freeze the base weights and train two small low-rank matrices A and B per attention layer. Typical adapter is 0.1–1% of the model's parameters.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                       # rank
    lora_alpha=32,              # scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 8.4M || all params: 7.2B || trainable%: 0.12

QLoRA

QLoRA (Dettmers et al., 2023) loads the base model in 4-bit (NF4) and trains LoRA adapters on top. This is what lets you fine-tune a 70B model on a single 80GB GPU. Use bitsandbytes + HuggingFace PEFT.

DoRA, OLoRA, rsLoRA

Newer variants that decouple magnitude/direction (DoRA), use orthogonal init (OLoRA), or rescale rank (rsLoRA). Marginal gains in most cases — start with vanilla LoRA, only switch if you've measured a problem.

3. Alignment: SFT is just step one

Supervised Fine-Tuning (SFT) teaches the model what good output looks like. It does not teach preferences, refusals, or reasoning quality. That's what alignment is for.

DPO (Direct Preference Optimization)

DPO (Rafailov et al., 2023) replaces the RLHF pipeline (reward model + PPO) with a single classification-style loss on preference pairs. Simpler, more stable, and the de facto default in 2026.

from trl import DPOTrainer, DPOConfig

config = DPOConfig(
    beta=0.1,                   # KL regularization
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=2,
)

trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,             # PEFT auto-handles reference
    args=config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)
trainer.train()

GRPO and RLVR

GRPO (Group Relative Policy Optimization, popularized by DeepSeek-R1) and RLVR (RL with Verifiable Rewards) are the techniques behind the reasoning-model wave. If you're training for math, code, or anything with a programmatic verifier — these matter.

The HuggingFace TRL library now ships first-class support for SFT, DPO, GRPO, and KTO.

4. The data pipeline is the moat

A bad dataset will defeat a perfect training loop every time. Things that actually move metrics:

Diversity over volume. 5K diverse examples beats 50K near-duplicates.
Hard negatives. For preference data, pairs where chosen and rejected are almost equally good teach more than obvious wins.
Decontamination. Strip eval-set leakage from training data. Always.
Format consistency. Tokenize early to catch chat-template mismatches before you waste 10 GPU-hours.
PII and licensing. This is where the EU AI Act lives. Document provenance.

5. The 2026 tooling stack

Here's what a production-grade fine-tuning project looks like today:

Layer	Tool
Training framework	HuggingFace TRL
Adapters	HuggingFace PEFT
Quantization	`bitsandbytes`
Distributed	Accelerate / DeepSpeed ZeRO-3 / FSDP
Experiment tracking	Weights & Biases or MLflow
Serving	vLLM
Eval harness	`lm-evaluation-harness` + custom domain evals
Closed-source baseline	OpenAI fine-tuning for comparison

Wiring all of this into a real CI/CD lifecycle is what separates a notebook experiment from a deployable system. That's the focus of MLOps: Prototype to Production.

6. Evaluation: where most projects quietly fail

Loss curves go down. The model "feels better." You ship. Production complaints spike. Sound familiar?

Build a holistic eval suite before you start training:

Capability evals — domain-specific tasks scored by rubric.
Regression evals — verify the model didn't lose abilities (catastrophic forgetting is real).
Safety evals — refusals, jailbreak resistance, policy adherence.
LLM-as-judge — useful, but bias-corrected with human spot-checks.
Cost & latency — TTFT, throughput, p95 — these are product metrics.

If your eval suite isn't version-controlled and reproducible, you don't have an eval suite. You have vibes.

7. Serving: the part nobody talks about until it breaks

LoRA adapters can be hot-swapped at inference time. vLLM, SGLang, and TensorRT-LLM all support multi-LoRA serving — meaning you can host one base model and dozens of fine-tuned adapters with near-zero overhead.

# vLLM with LoRA adapters
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules legal-adapter=./adapters/legal sales-adapter=./adapters/sales \
  --max-loras 4

This is the architectural unlock that makes fine-tuning economically viable for SaaS multi-tenancy.

8. EU AI Act: not optional in 2026

If you're shipping in the EU, fine-tuning a foundation model can put you in the deployer or provider category under the EU AI Act. Practical consequences:

Model card documenting training data, intended use, limitations.
Risk assessment if the use case touches Annex III (HR, education, critical infrastructure, law enforcement, etc.).
Logging of significant model updates and eval results.
Transparency obligations to end users for AI-generated content.

This isn't lawyer paranoia — auditors are already asking. Bake it into your pipeline from day one.

9. The mistakes I see most often

Fine-tuning before exhausting prompting and RAG. Cheaper, faster, easier to roll back.
Using r=64 because "bigger is better". Most tasks saturate at r=8 to r=16. Measure.
Mismatched chat template between training and inference. Silent quality killer.
Training on the eval set. Decontaminate. Then decontaminate again.
Skipping the SFT-only baseline. You can't claim DPO helped if you didn't measure SFT-only first.
Ignoring catastrophic forgetting. Always run a regression eval against the base model.
Forgetting the FinOps math. A $400 fine-tune that adds $0.002/request to inference is not a win at 1M requests/day.

Where to go next

If you want a structured path that goes from prompt engineering to deploying fine-tuned models in production:

Foundation: Introduction to AI Engineering
Before fine-tuning: Prompt Engineering Masterclass → RAG: Retrieval-Augmented Generation
The full deep dive: Fine-Tuning and Model Adaptation — Enterprise Edition (LoRA/QLoRA/DoRA, DPO/GRPO/RLVR, vLLM serving, EU AI Act, end-to-end project)
Productionization: MLOps: Prototype to Production
Integration layer: MCP — Model Context Protocol

Browse the full IT engineering track at cursuri-ai.ro/cursuri/it.

Closing thought

Fine-tuning in 2026 is no longer about can the model learn the task. It's about whether your dataset, eval suite, serving stack, and governance process are good enough to deserve a custom model. Get those right, and a single adapter can be the difference between a feature that costs you money and a feature that defines your product.

If this resonated, I'd love to hear what fine-tuning problem you're actually stuck on — drop it in the comments. 👇

Originally published on Cursuri-AI.ro — the AI engineering education platform for Romanian and EU professionals.

DEV Community