DEV Community: finetuning

RAG vs Fine-Tuning: Which One Should You Actually Choose?

EncodeDots Technolabs — Mon, 29 Jun 2026 12:28:50 +0000

You wired up an LLM, pointed it at a user question about your product, and it confidently invented an API endpoint that doesn't exist. Welcome to the moment every AI engineer eventually hits: the base model is smart, but it doesn't know your company's data, documentation, or latest product changes.

There are two mainstream ways to fix that-RAG and fine-tuning-but most explanations stop at "it depends." This article goes further, breaking down when retrieval beats retraining, when fine-tuning is the better choice, and how to choose the right approach for real-world AI applications.

The one-line mental model

RAG = the model looks things up at inference time. Behavior-and sometimes domain-specific patterns-are learned during training.

Fine-tuning = you bake new behavior into the weights via training. Knowledge/behavior lives inside the model.

The single most useful question to disambiguate them:

Is my problem that the model doesn't know my facts, or that it doesn't behave the way I want?

Knowledge gap → RAG. Behavior gap → fine-tuning. Most "we need fine-tuning" requests turn out to be knowledge gaps that RAG can solve more efficiently.

RAG in code (runnable)

The whole RAG loop is: embed your docs → store the vectors → at query time, embed the question, find the nearest chunks (semantic search), stuff them into the prompt.

This example runs as-is. Embeddings use sentence-transformers (local, no API key); generation uses Claude. Swap the local embedding model for a hosted one (OpenAI, Voyage, or Cohere) by replacing the embedder. encode(...) calls.

pip install sentence-transformers numpy anthropic
export ANTHROPIC_API_KEY=sk-...   # for the generation step

import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

embedder = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast, local
client = Anthropic()  # reads ANTHROPIC_API_KEY from env

# 1. Offline: chunk + embed your knowledge base
docs = [
    "Refunds are processed within 5 business days.",
    "Enterprise plans include SSO and a 99.9% SLA.",
    "The API rate limit is 100 requests per minute on the Pro tier.",
]
doc_vecs = embedder.encode(docs, normalize_embeddings=True)  # (n_docs, dim)

def retrieve(query, k=2):
    q = embedder.encode([query], normalize_embeddings=True)[0]
    sims = doc_vecs @ q                       # cosine sim (vectors are normalized)
    top = sims.argsort()[-k:][::-1]
    return [docs[i] for i in top]

# 2. Online: retrieve, then generate grounded in retrieved context
def answer(query):
    context = "\n".join(retrieve(query))
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                "Answer using ONLY the context below. "
                "If it's not in the context, say you don't know.\n\n"
                f"Context:\n{context}\n\nQuestion: {query}"
            ),
        }],
    )
    return msg.content[0].text

print(answer("What's the rate limit on Pro?"))
# -> The API rate limit is 100 requests per minute on the Pro tier.

In production, you'd replace the in-memory NumPy search with a vector database such as pgvector, Qdrant, Weaviate, or Pinecone to keep retrieval fast as your knowledge base grows. The retrieval logic stays the same-you've simply replaced a linear search with an index built for scale.

Why engineers reach for RAG:

Update knowledge by changing a document - no retraining.
Answers are traceable; you know which chunk produced them.
Sensitive data stays in your store, not in model weights.

Fine-tuning in code

Fine-tuning doesn't retrieve anything. You train the model on hundreds or thousands of input→output examples until the desired behavior becomes consistent.

Most providers expect a JSONL file, where each line contains a complete training conversation.

{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "I was charged twice this month."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "The export button does nothing on Safari."}, {"role": "assistant", "content": "bug"}]}
{"messages": [{"role": "system", "content": "Classify the support ticket into: billing, bug, feature_request, other."}, {"role": "user", "content": "Can you add dark mode?"}, {"role": "assistant", "content": "feature_request"}]}

Once the dataset is ready, you upload it and start a fine-tuning job. The exact SDK differs between providers, but the workflow looks like this:

# Adapt to your provider's fine-tuning SDK.
job = client.fine_tuning.jobs.create(
    training_file="ticket_classifier.jsonl",
    model="base-model-name",
    hyperparameters={"n_epochs": 3},
)
# poll job.status until "succeeded", then call the resulting model id

Notice that the model never retrieves external documents at inference time. Everything it learned comes from the training examples you provided.

Why engineers reach for it:

Consistent formatting, tone, and behavior at scale without repeating detailed instructions in every prompt.
Specialized decision-making that's difficult to capture through retrieved documents alone.
Shorter prompts at inference because the desired behavior is learned during training, which can reduce token costs at high request volumes.

The catch: every time your requirements change, you need to update the training data and run another fine-tuning job. Data preparation is usually the biggest investment-not the training itself.

The decision checklist

Run down this list; stop at the first strong signal.

Knowledge changes often? → RAG. Retraining on every document update is masochism.
Need source-cited / auditable answers? → RAG. Fine-tuned weights can't tell you where an answer came from.
The model keeps getting the format, tone, or judgment wrong-not the facts? → Fine-tuning.
Have a few hundred clean, labeled examples? Fine-Tuning becomes a realistic option. If not, RAG is usually the faster place to start.
Still unsure? → Start with RAG. It's cheaper to build, easier to debug, and solves the most common problem: the model doesn't have access to your knowledge.

And here's the part most posts skip: they're not mutually exclusive. Mature AI systems often fine-tune for behavior and layer RAG on top for fresh knowledge. "RAG vs. Fine-Tuning" is increasingly becoming "RAG + Fine-Tuning."

Gotchas I've hit

Chunking quietly decides your accuracy. Bad chunk boundaries-like splitting tables mid-row or creating 2,000-token mega-chunks-hurt retrieval before the model ever sees the question.
RAG is not plug-and-play. Retrieve the wrong context, and the model will confidently produce the wrong answer.
Fine-tuning a knowledge problem is the classic expensive mistake. If the goal is simply to teach the model your latest pricing," fine-tuning is slow, costly, and goes stale. Use RAG.
No eval = no progress. Build a small labeled test set before launch. Without one, you're optimizing blind, and "it feels better" becomes your only metric.
Garbage in, confident garbage out. Both approaches amplify whatever you feed them. Clean the data first.

RAG vs. Fine-Tuning at a Glance

RAG

Best for: Knowledge gaps
knowledge Update: Edit your documents-no retraining
Knowledge freshness: Always reflects your latest data
Traceable answers: Yes, via retrieved context
Upfront cost: Lower
Best first step: For most AI applications

Fine-Tuning

Best for: Behavior gaps
knowledge Update: retrain the model
Knowledge freshness: Fixed until the next training run
Traceable answers: Not inherently
Upfront cost: Higher (data preparation is the biggest cost)
Best first step: Once behavior becomes the bottleneck

Conclusion

The "RAG vs Fine-Tuning" debate isn't a turf war - it's a routing decision. Point a knowledge problem at RAG, and a behavior problem at fine-tuning, and most of the confusion disappears.

For the vast majority of teams, the right first move is RAG: it's cheaper to build, far easier to debug, ships in days instead of weeks, and directly solves the most common failure mode - the model not knowing your stuff. Reach for fine-tuning when the model already has the facts but keeps getting the format, tone, or judgment wrong, and you've got a few hundred clean, labeled examples to teach it. When you've earned the complexity, run both: fine-tune for behavior, layer RAG on top for fresh, traceable facts.

The expensive mistake to avoid is reaching for fine-tuning to fix a knowledge gap. It's slow, it goes stale, and a retrieval layer would've done the job in a fraction of the time. Start simple, measure with a real eval set, and only add weight to the system when the evidence says you need it.

Best GPU for DreamBooth Training in 2026 (Ranked)

Thurmon Demich — Fri, 26 Jun 2026 01:14:26 +0000

This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.

You found an art style you love, or maybe you want an AI that generates your face accurately. DreamBooth is how you get there -- but it is one of the most VRAM-hungry tasks in consumer AI. Inference is forgiving. Training is not.

Quick answer: The RTX 4090 (24GB, ~$1,600) is the best GPU for DreamBooth training. For SD 1.5 DreamBooth only, the RTX 4070 Ti Super (16GB, ~$700) works with optimizations.

Who this is for

You want to fine-tune Stable Diffusion or Flux models on your own images. DreamBooth creates a personalized model checkpoint that generates specific subjects -- faces, products, art styles, characters. Unlike LoRA, full DreamBooth training modifies the entire model and needs substantially more VRAM.

VRAM requirements for DreamBooth

DreamBooth Target	VRAM Needed	Training Time (1000 steps)	Minimum GPU
SD 1.5 (full fine-tune)	~14GB	~15 min	RTX 4060 Ti 16GB
SD 1.5 (with prior preservation)	~16GB	~25 min	RTX 4070 Ti Super
SDXL (full fine-tune)	~22GB	~45 min	RTX 4090
SDXL (with prior preservation)	~24GB	~60 min	RTX 4090
Flux DreamBooth	~26GB	~90 min	RTX 5090

These numbers assume FP16 training with gradient checkpointing enabled. Without gradient checkpointing, add 30-50% more VRAM.

VRAM chart available at the original article

GPU comparison for DreamBooth

GPU	VRAM	SD 1.5 DB	SDXL DB	Flux DB	Price
RTX 5090	32GB	~8 min	~25 min	~55 min	~$2,000
RTX 4090	24GB	~12 min	~40 min	Tight	~$1,600
RTX 3090 (used)	24GB	~18 min	~55 min	Tight	~$800
RTX 5080	16GB	~14 min	Offload	No	~$1,000
RTX 4070 Ti Super	16GB	~18 min	Offload	No	~$700
RTX 4060 Ti 16GB	16GB	~28 min	Offload	No	~$400

Training times are for 1000 steps with gradient checkpointing and FP16. "Offload" means it technically works with model offloading but training becomes 3-5x slower.

Which GPU should you buy?

SD 1.5 DreamBooth only? The RTX 4070 Ti Super with 16GB handles it. Use gradient checkpointing and FP16. Training takes under 20 minutes per subject.
SDXL DreamBooth? You need 24GB. The RTX 4090 is the standard choice. A used RTX 3090 at ~$800 works too -- slower but the VRAM is there.
Flux DreamBooth? The RTX 5090 at 32GB is nearly mandatory. Flux's larger architecture pushes VRAM demands above what 24GB cards can handle comfortably.
Budget option? The RTX 4060 Ti 16GB can train SD 1.5 DreamBooth with aggressive optimization. Not fast, not comfortable, but functional.

Common mistakes to avoid

Skipping gradient checkpointing -- this single setting reduces VRAM usage by 30-40% at the cost of 15% slower training. Always enable it for DreamBooth. There is no reason not to.
Using too many training images -- DreamBooth works best with 15-30 high-quality images. Using 200 images wastes training time and does not improve results.
Training too many steps -- overtrained DreamBooth models produce distorted outputs. 800-1500 steps is usually the sweet spot for SD 1.5. SDXL needs fewer steps, not more.
Ignoring LoRA as an alternative -- if your GPU has less than 24GB, LoRA training achieves 80-90% of DreamBooth quality at a fraction of the VRAM cost. I use LoRA for most personal training now.

Final verdict

Training Target	Best GPU	Why
SD 1.5 DreamBooth	RTX 4070 Ti Super	16GB is enough
SDXL DreamBooth	RTX 4090	24GB needed
Flux DreamBooth	RTX 5090	32GB for comfort
Budget SD 1.5	RTX 4060 Ti 16GB	Affordable 16GB

For LoRA training specifically (the lighter alternative to DreamBooth), check the best GPU for fine-tuning guide. For broader Stable Diffusion GPU needs, see the best GPU for Stable Diffusion roundup. If you use Kohya_ss to manage your training scripts, see our best GPU for Kohya_ss guide for trainer-specific configuration.

DreamBooth is the one AI task where "more VRAM" is not just a nice-to-have but a hard requirement. Buy the most VRAM you can afford and use gradient checkpointing. Full stop.

Related guides on Best GPU for AI

Read the full guide on Best GPU for AI — includes our VRAM calculator, GPU comparison table, and live pricing.

Fine-Tuning AI Models Is No Longer Just for ML Engineers

Basavaraj SH — Thu, 25 Jun 2026 13:14:53 +0000

The gap between "using AI" and "owning AI" is closing fast - and understanding why matters for anyone building products or running a business today.

The Real Cost of Generic AI Models

Most people start their AI journey the same way: they pick up a general-purpose model, plug it into their workflow, and wait for magic. It works - sort of. The responses are decent, the outputs are readable, but something feels off. The model doesn't quite understand your industry's terminology. It misses the tone your brand needs. It gives confident-sounding answers that are just slightly wrong for your specific use case.

This is the limitation of off-the-shelf AI. These models are trained on broad internet data, which makes them impressively general but frustratingly imprecise. A legal tech startup and a fitness app both get the same baseline model, even though their needs couldn't be more different.

The solution has always been fine-tuning - taking a pre-trained model and training it further on your specific data so it learns your context, your language, and your goals. The problem? Until recently, fine-tuning required a dedicated ML engineering team, expensive GPU infrastructure, and weeks of iteration time. For a small business owner or a product manager without a technical background, that door was essentially closed.

What Fine-Tuning Actually Means - and Why It's Getting Easier

Think of a pre-trained language model like a very well-read generalist. It has absorbed enormous amounts of text and learned patterns in language, reasoning, and knowledge. Fine-tuning is like giving that generalist a focused apprenticeship in your specific domain. You show it examples of the kind of work you need, and it recalibrates.

What's changed recently is the tooling around this process. Frameworks are emerging that abstract away much of the technical complexity - handling things like memory optimization, hardware configuration, and training efficiency behind the scenes. The person running the fine-tuning no longer needs to understand every technical detail of what's happening under the hood, just as you don't need to understand how a car engine works to drive one.

One meaningful development in this space is the increasing compatibility between model repositories (where pre-trained models live) and training acceleration tools. When a model can move smoothly from a public library into a fine-tuning pipeline without extensive manual configuration, the barrier drops significantly. What once took a team and weeks can now be done faster, with fewer people, and with more reproducible results. That's not a small shift - it changes who gets to customize AI and for what purposes.

Real Example - Step by Step

Here's how a fine-tuning workflow would look in plain terms:

Step 1 - Collect your training data. Gather 200 to 500 examples of ideal customer interactions. These could be edited versions of real support tickets where your best agent gave the perfect answer. Format them as question-answer pairs.

Step 2 - Choose a base model. Pick a smaller, efficient model from a public repository that's close to your needs. You don't need the largest model available - smaller fine-tuned models often outperform large generic ones on specific tasks.

Step 3 - Run the fine-tuning. Using a modern training framework, you point the tool at your data and your chosen model. The framework handles memory management and optimization. You set a few parameters - how many training passes, the learning rate - often guided by sensible defaults.

Step 4 - Evaluate. Test the fine-tuned model against your original problem. Does it now correctly reference your 30-day return window? Does it match your tone? Compare its outputs against your baseline.

Step 5 - Deploy and monitor. Push the model into your support interface and track where it still struggles. Fine-tuning is iterative - your second round will be better than your first.

The whole process, with modern tooling, can happen in days rather than months.

How to Apply This Today

You don't need to run a fine-tuning job this week to start benefiting from this shift. Here's what you can do right now:

Audit your current AI pain points. Write down three specific cases where your AI tool gives you wrong, generic, or off-brand outputs. These are your fine-tuning candidates.

Start collecting training data now. Even if you're not ready to fine-tune yet, begin saving examples of ideal outputs - good customer emails, well-written product descriptions, accurate support responses. This library will be your fuel when you're ready.

Explore accessible platforms. Several platforms now offer fine-tuning workflows with user interfaces that don't require you to write code. Look for ones that support parameter-efficient methods, which are faster and cheaper than full model training.

Talk to your ML team (or find one). If you're a product manager or business owner, connect with someone technical who can run the training process while you own the data strategy and evaluation criteria. The collaboration model works well - you don't need to become an ML engineer, just a smart collaborator.

Set a specific success metric before you start. "Better outputs" isn't measurable. "Correctly answers our return policy question 90% of the time" is.

Key Takeaways

Fine-tuning adapts a general AI model to your specific domain, data, and tone - making it meaningfully more useful than a generic baseline.
The biggest barrier to fine-tuning used to be technical complexity and cost; modern tooling is rapidly reducing both.
You don't need to be an ML engineer to lead a fine-tuning project - you need good data, clear success criteria, and the right collaborators.
Start collecting your "ideal output" examples now, even before you're ready to train anything.
Smaller fine-tuned models often outperform larger generic models on specific tasks - bigger isn't always better.

What's your experience with this? Drop a comment below - I read every one.

Sources referenced: Hugging Face Blog - Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

Reinforcement Fine-Tuning with GRPO: Teach a Small Model to Reason

AI Tech Connect — Tue, 23 Jun 2026 13:30:09 +0000

Originally published on AI Tech Connect.

What you need to know Reinforcement fine-tuning teaches a model to think, not to copy. Instead of imitating labelled examples the way supervised fine-tuning does, RFT hands the model a reward signal for a verifiable outcome and lets it discover its own strategy. It is the right tool when correctness is checkable — maths, code that must pass tests, structured extraction, tool-use that either works or does not. GRPO is the algorithm everyone is using now. Group Relative Policy Optimization is a leaner PPO: it drops the separate critic model and instead samples a group of answers per prompt and scores each against the group's own average. That single change roughly halves the memory and makes RFT feasible on one GPU. The reward function is the entire job. A good reward combines a verifier…

Read the full article on AI Tech Connect →

How to Align an LLM with DPO and ORPO: A Practical Guide

AI Tech Connect — Sat, 20 Jun 2026 13:30:19 +0000

Originally published on AI Tech Connect.

What this guide gives you Most teams that fine-tune an open-weight model stop after supervised fine-tuning, then wonder why the model is technically correct but somehow off — too terse or too waffly, agreeable when it should push back, willing to answer things it should decline. The reason is structural. Supervised fine-tuning teaches the model to imitate one good completion per prompt. It never sees a worse completion, so it never learns that one answer is better than another. Preference tuning is the step that closes that gap, and in 2026 it is the standard practice for aligning open-weight models after SFT. This is a recipe you can keep and reuse. The methods are stable: preference pairs of chosen and rejected responses, Direct Preference Optimisation (DPO) against a frozen reference,…

Read the full article on AI Tech Connect →

RAG's Context Trap Forces Hypernetwork Agents Into View

XOOMAR — Fri, 19 Jun 2026 18:01:49 +0000

On June 19, 2026, VentureBeat put a sharp label on a problem enterprise AI teams already know: hypernetwork agents are emerging because fine-tuned models go stale, while RAG systems can lose the very context they were meant to supply.

That matters now because agent pilots keep hitting the same wall. A demo runs cleanly. Production stretches the task across policies, files, exceptions and approvals. Then a human starts feeding the agent more context, checking every answer and quietly doing the supervision the system was supposed to remove, according to VentureBeat.

When AI firm Chroma tested 18 leading models, “every one lost accuracy as its input grew.”

That finding is the technical hook. Longer context does not automatically make an agent safer. It can make the agent shakier.

June 19: why enterprise AI agents stall after the demo

The failure is not always orchestration. Routing, durable execution and observability help an agent coordinate work, but they assume the agent is competent enough to make good decisions as the job unfolds.

The deeper issue is where the company’s knowledge lives.

If the agent has to keep ingesting more business context as it works, the task gets heavier with every step. The prompt grows. Retrieval becomes more important. Missed details become harder to spot. The agent may still produce fluent output, but the employee is now watching the machine instead of doing higher-value work.

That is the autonomy ceiling. The agent performs the task, but the human still owns the risk.

For enterprises, this is not a philosophical problem. It affects whether an AI system can run a long audit, compliance check or risk workflow overnight and leave a person to validate the last 10%, rather than babysit the first 90%.

After Chroma’s 18-model test: why fine-tuning and RAG still need a human

Enterprises have mostly used two methods to teach models their business.

Fine-tuning puts knowledge into the model’s weights. That can improve performance on a specific task, but it brings a known weakness: catastrophic forgetting, identified in the 1980s and described in the source as still unresolved in 2026. Teach the model something new and it can erode what it already knew.

Teams often work around that by creating task-specific models or adapters. That helps isolation, but it also creates model sprawl. Governance gets harder. Costs rise. A fine-tuned model also becomes a snapshot. The day a policy changes, the retraining cycle starts again.

RAG and in-context learning take the other route. They place relevant documents and policies into the prompt at run time. That keeps knowledge fresher, but it shifts the risk to retrieval and context handling. A retrieval miss can look just like a correct answer. A detail buried in a long prompt can vanish from the model’s effective reasoning.

The failures rhyme:

Approach	Where it breaks	What the human sees
Fine-tuning	Stale policy or forgetting	A confident answer from old rules
RAG	Retrieval miss or context rot	A confident answer with missing context
Both combined	Partial mitigation, not certainty	More output that still needs checking

For teams managing model versions, adapters and evaluation artifacts, the governance problem touches the same MLOps concerns covered in XOOMAR’s guide to Open Source Model Registry Tools MLOps Teams Should Bet On. For knowledge-heavy AI systems, it also overlaps with the failure modes in Bad LLM Platforms Break Enterprise Knowledge Search.

ICML 2025 to SHINE 2026: how hypernetwork agents build specialists on demand

Hypernetwork agents try a third path. Instead of retraining one model or stuffing a giant prompt, a generator creates a small task-specific model adaptation at inference time.

A hypernetwork is a network whose output is the weights of another network. In this use case, it can generate an adapter from current business policies for a specific task.

The concept was named in 2016, but applying it to specialist language models from text or documents is newer. VentureBeat points to Sakana AI’s Text-to-LoRA, presented at ICML 2025, which generates a model adapter from a plain-language description in a single pass. It also cites a 2026 system called SHINE, which frames hypernetwork adaptation as a promising frontier because it avoids some fine-tuning cost and prompting limits.

The model-zoo angle is the cleanest part. Enterprises already create per-task adapters to avoid interference between tasks. A hypernetwork turns those adapters into generated outputs instead of assets teams must train, store, update and govern one by one.

That does not remove governance. It changes what must be governed. The key artifact becomes the generator, the policy data it reads and the feedback loop that improves it.

Overnight compliance review: where a generated specialist could help

Consider a regulated company that wants an agent to review audit evidence overnight, map it against internal policies, flag gaps and prepare a report before staff arrive.

A fine-tuned model may know the workflow, but it may also be working from last quarter’s policy. A RAG agent can pull current documents, but it may miss a relevant policy or bury a crucial detail in a long prompt. A hypernetwork-generated model would, in theory, generate a narrow specialist from the current policy set for that specific review.

That matters economically if the job involves many agent steps. A 2025 paper by Nvidia researchers, cited by VentureBeat, says small models are capable enough for narrow, repetitive agent tasks and 10 to 30 times cheaper to run than frontier generalists.

Nace.AI is the commercial example in the source. The Palo Alto company raised a $21.5 million seed round in May. Its generator, called a MetaModel, produces parameter adaptations at inference time from company policies, targeting audit, compliance and risk assessment. The company markets a 90/10 split: agents handle the bulk of the workflow, while human experts validate the result.

Read that ratio carefully. It is not magic autonomy. It is a claim about reducing supervision by narrowing the model’s job and making review faster.

Peer review is the next test: where hypernetwork-built agents can break

The first weak point is calibration. The generated model must know when it is unsure. VentureBeat notes that recent work on generated adapters did not show automatic calibration gains over ordinary fine-tuning in every setting. Gains appeared only under specific constraints.

The second risk is data quality. If policies, procedures and examples are messy, the generated specialist inherits that mess. A hypernetwork cannot turn bad governance data into reliable judgment.

Scale is also unsettled. Published hypernetwork work has often been small. Nace says it has scaled its generator beyond published sizes and derived a scaling law for performance growth, with results being shared publicly and put through peer review. That paper is the one to watch.

Human review is another failure point. VentureBeat cites Deloitte Australia’s roughly A$440,000 government report, which shipped with fabricated citations and an invented court quote after senior review. The reviewers checked conclusions, not provenance. The EU AI Act’s Article 14 names the broader risk as automation bias.

A high-autonomy system compresses human attention into a late review step. That only works if every claim is grounded, cited and easy to verify.

Before a pilot: the four questions buyers should force vendors to answer

A buyer evaluating hypernetwork agents should start with architecture, not the headline autonomy ratio.

Ask:

Knowledge location: Does business knowledge live in model weights, prompts or generated adaptations?
Grounding: Does each output include citations, source passages and reasoning traces?
Escalation: What confidence thresholds, unsupported claims or policy gaps send work back to a human?
Ownership: When experts correct the agent, whose model improves, where does it run and does the asset stay inside the customer’s cloud?

The practical read is narrow. For long, repetitive, high-volume work where policies matter, hypernetwork-generated specialists deserve a pilot. For short tasks that finish in a few steps, the integration cost may buy little over a well-prompted frontier model.

The next decision point is evidence. Calibration and scale need validation beyond vendor claims. Until then, treat hypernetwork agents as the most credible new route past fine-tuning staleness and RAG context rot, but not as a replacement for provenance, review design and hard ownership terms.

Impact Analysis

Chroma’s test of 18 leading models found accuracy declined as input length grew.
Enterprise agent pilots can fail when demos become long workflows involving policies, files and approvals.
The key business goal is moving humans from supervising the first 90% of work to validating the last 10%.

Originally published on XOOMAR. For more news and analysis, visit XOOMAR.

Fine-Tuning Llama 3.2 3B on Medical QA: Week 4 - When Lower Loss Meant a Worse Model

Nicholas (Kosisochukwu) Ugbala — Tue, 16 Jun 2026 11:33:04 +0000

What Happened This Week

Week 3 produced a working fine-tuned model: one epoch, one dataset, a clear improvement over the base model. This week 4 was supposed to make it better with More data (a second dataset), two epochs, and a cleaner setup.

The eval loss dropped from 2.495 to 2.275. By that number alone, Week 4 was going to be a success.

The model was worse.

This is the story of how a better loss number hid a serious regression, how I diagnosed it, and what it took to actually fix it. It is one of the most useful things I have learned in this project.

The Plan

Four changes over Week 3:

Combine two datasets: ChatDoctor (conversational patient-doctor QA) and MedAlpaca WikiDoc (encyclopedic clinical reference), for both conversational style and factual grounding.
Use Llama's built-in pad token instead of adding a custom one to avoid an oversized adapter file.
Train for two epochs on the full dataset instead of one.
Switch evaluation to greedy decoding for reproducibility.

The Pad Token Fix

In Week 3, I added a custom pad token and resized the model's embedding layer. This had an unintended cost: PEFT saved the entire resized embedding layer alongside the LoRA adapters, producing a 3.19GB adapter file instead of the expected ~50MB.

Llama 3.2's tokenizer already ships with a reserved padding token, <|finetune_right_pad_id|> (token 128004), made for exactly this purpose. Using it instead of adding a new token:

...
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.padding_side = "right"
# No add_special_tokens, no resize_token_embeddings

No embedding resize means no embedding layer saved with the adapters. The Week 4 adapter came out at ~50MB. What I Learned: Before adding a special token, check whether the model already has one. Llama 3.2 did.

Building the Combined Dataset

ChatDoctor alone produced a model that answered in a conversational manner but sometimes lacked factual grounding. WikiDoc is reference-grade encyclopedic medical content. The Idea was that combining them would give both conversational style and factual grounding.

The first combine used 8,000 ChatDoctor and 4,000 WikiDoc, a 2:1 ratio. After cleaning and a 512-token length filter, this produced 10,255 rows: 9,229 train, 1,026 eval.

The cleaning itself was an exercise in diminishing returns.
ChatDoctor forum data carries platform filler ("Hello, welcome to Chat Doctor", "Hope this helps"), and the source has OCR-level corruption that breaks pattern matching ("HiT hanks" for "Hi. Thanks"). I built a two-pass regex cleaner plus a sentence-level trailing filler stripper that removes whole closing sentence containing filler keywords. It caught most of the noise. A small fraction of corruption resisted cleaning entirely, which I documented rather than chasing.

The Training Run That Looked Like Success

Two epochs, 1,154 steps, about four hours on a Kaggle T4.

Step 150:   train 2.499  |  eval 2.474
Step 300:   train 2.336  |  eval 2.340
Step 600:   train 2.282  |  eval 2.297
Step 900:   train 2.241  |  eval 2.279
Step 1050:  train 2.231  |  eval 2.275

A clean, healthy loss curve. Eval loss dropped steadily to 2.275, well below Week 3's 2.495. Train and eval tracked each other closely, indicating no classic overfitting. Mean token accuracy rose to 0.515.

Every number said the model improved.

The Regression

Then I ran the five test questions with greedy decoding.

The diabetes answer began correctly, then collapsed:

Eye yawning
Eye yawns
Eye years
Eye yolks
Eye yummy
Eye yogurt

A complete generation breakdown, under greedy decoding, which is supposed to be the stable option. The heart attack answer produced a runaway list that drifted from cardiac symptoms into sore throats and ear pain. Hypertension confidently recommended atenolol as first-line therapy, which is wrong: beta-blockers are not first-line for uncomplicated hypertension.

The model with the better loss number produced worse answers than Week 3's model.

Diagnosing It

Two things were happening, and separating them mattered.

First, the repetition penalty backfired. I had set no_repeat_ngram_size=3, which forbids repeating any three-token sequence. Once it generates a three-token phrase like "consult your doctor", it can never produce that exact phrase again in the same answer. The intent is to stop repetition loops. The effect was the opposite: when the model wanted to end a list by repeating a natural closing pattern, the rule forbade it, forcing a brand-new token every time. The only way to keep producing non-repeating tokens was to drift into nonsense: "Eye yummy, Eye yogurt." The setting meant to prevent loops was driving the degeneration.

Second, and more fundamental: the model had overfit to list generation. The combined dataset, especially the Wiki/doc half, contained many list-formatted answers. Two epochs reinforced a pattern: when answering, produce a list and keep extending it. On questions with naturally bounded answers (a mechanism or short cause), the model stayed controlled. On questions inviting enumeration (drugs, symptoms), it started a list and could not stop, eventually confabulating list items: invented drugs like "artuzofloxacin", invented symptoms.

The loss curve never showed this because loss measures next-token prediction accuracy on the eval set. A model can be better at predicting the next token while getting worse at producing a coherent, bounded, truthful answer.

The Fix

Three changes. applied together.

Rebalanced the data.* Dropped WikiDoc from 4,000 to 1,500 and raised ChatDoctor to 8,500, roughly 85% narrative prose, 15% encyclopedic. ChatDoctor's conversational answers train the model toward bounded, flowing responses rather than open-ended lists. This attacked the root behaviour.

Expanded the LoRA target modules. Those module names need a short explanation. Each layer of the model has two parts that do different jobs. The attention layers(q_proj, k_proj,v_proj, o_proj) decide what to pay attention to: how tokens relate to each other, how "the patient" connects to "their symptoms" later in the same question. The feed-forward layers (up_proj, down_proj, gate_proj) are where factual knowledge tends to be stored and retrieved; research shows they behave somewhat like a key-value memory, where a concept goes in and the associated facts come out.

Week 3 applied LoRA only to the attention layers, leaving the feed-forward layers frozen. That tuned how the model routes information but left the layers that hold the facts untouched. The confabulation, inventing drugs like "artuzofloxacin", was a factual recall failure: the model could not keep real drug names active while generating a list. So in Week 4, I added the feed-forward layers to the LoRA targets, letting fine-tuning adjust the part of the model where the facts live, not just the attention routing. (The claim that facts live in the feed-forward layers is a simplification; knowledge is distributed across the whole model. But as a reason to target those layers for a recall problem, it holds.)

Fixed generation. Removed no_repeat_ngram_size entirely. Set eos_token_id explicitly to <|eot_id|> so the model can actually stop. Used repetition_penalty=1.3 to discourage loops without a hard ngram ban, and capped max_new_tokens at 256.

outputs = model.generate(
    input_ids=encoded_inputs["input_ids"],
    attention_mask=encoded_inputs["attention_mask"],
    max_new_tokens=256,
    do_sample=False,
    repetition_penalty=1.3,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    pad_token_id=tokenizer.pad_token_id,
)

One epoch on the rebalanced data.

The Result

The degeneration was gone. Hypertension named four real first-line drug classes (ACE inhibitors, ARBs, beta-blockers, calcium channel blockers) and stopped. Malaria named real treatments (artemether-lumefantrine, chloroquine, mefloquine) and stopped. The diabetes and iron deficiency answers stayed accurate. The heart attack answer, which had failed in every previous run, finally produced seven correct cardiac warning signs and stopped.

And running the same question twice produced byte-for-byte identical output. Greedy decoding made the results reproducible, which is what makes the claims defensible. In Week 3, the same question could give a good answer one run and collapse the next. Now the model's behaviour is consistent and verifiable.

What I Actually Learned

Lower Loss is not a better model. Eval loss measures next-token prediction. It does not measure factual accuracy, coherence, or whether the model knows when to stop. The Week 4 two-epoch model had the best loss and the worst generation. I wouldn't have caught this if I went with the notion that decreased loss equals a better-performing model, and not manually test the model's output.

Generation settings are not an afterthought. The same weights produced a total collapse or a clean answer depending on the decoding configuration. A repetition penalty meant to help actively caused the degeneration. Half of the battle with a small model is how you decode it.

Small models have a ceiling. A 3B fine-tuned on consumer hardware handles clinical QA well but struggles to enumerate without confabulating. Rebalancing the data and expanding LoRA targets pushed that ceiling up, but it is a real limit. In production, the answer would be a larger base model. Naming the constraint honestly gives clarity on how to go about making better improvements.

Reproducibility is a feature you build in. Greedy decoding, a fixed seed, a pinned data sample. Without these, "the model does X" is not a claim I can stand behind.

Where the Model Lives

nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2

Adapter file: ~50MB, the pad token fix working as intended.

What's Next

Week 5 wraps the model in a FastAPI inference endpoint, containerises it with Docker, and deploys it to a public URL any one can call. The generation settings worked out this week become the server's defaults.

Model: huggingface.co/nicholas-ugbala-hf/llama-3.2-3b-medical-finetuned-v2
Dataset: huggingface.co/datasets/nicholas-ugbala-hf/medical-qa-narrative-10k
Repo: github.com/nicholas-ugbala-dev/healthcare-llm-finetune

LoRA and QLoRA from Scratch: An Eval-Driven Fine-Tuning Recipe

AI Tech Connect — Tue, 16 Jun 2026 11:30:16 +0000

Originally published on AI Tech Connect.

What this recipe gives you Fine-tuning has a reputation as a dark art reserved for teams with eight-figure GPU budgets. It is not. With LoRA and QLoRA you can adapt a 7B or 8B open-weight model on a single consumer-grade card, in an afternoon, for the price of a couple of cups of coffee. The hard part was never the compute. The hard part is knowing whether you should fine-tune at all, building a dataset that matches how you will actually call the model, and proving the result is better rather than merely different. This is a recipe you can keep and reuse across models and tasks. The methods here are stable: low-rank adaptation, 4-bit quantisation, an 80/10/10 split, an eval set built first. The specific model you point it at — Llama, Qwen, Mistral, Gemma — barely changes the steps. Here…

Read the full article on AI Tech Connect →

Build Small Hackathon - Quillwright

AaryaP — Mon, 15 Jun 2026 22:59:04 +0000

How Quillwright turns a photo and a voice note into a tradesperson's estimate, with an orchestra of small models, on your own machine, and not a single number invented by an LLM.

The job nobody wants

Every tradesperson does the same unpaid hour after the real work is done: writing up the estimate. Parts, quantities, labor, a defensible total. Quillwright is an on-device, human-supervised agent that does that draft from a field capture (a job photo plus a spoken note) and hands back an itemized, editable estimate.

The constraints we set ourselves were the interesting part: small models (≤32B), no third-party AI APIs, and a hard rule that no customer-facing number ever comes from a language model. Those three constraints shaped every decision below.

An orchestra, not a soloist

There's no single model doing the work. Each role in the pipeline resolves to a small, purpose-fit model:

Perception: MiniCPM-V (OpenBMB) reads the job photo into observations ("RUN CAPACITOR", a nameplate model number).
Agent Brain: NVIDIA Nemotron-3-Nano drives a narrow tool-calling loop: which items, what quantities, when it's done.
Audio: Cohere Transcribe turns the voice note into text on-device.
Multilingual: Cohere Aya translates the customer-facing copy (Spanish, French, Mandarin), descriptions only, never the numbers.
Embedding: a small embedder powers semantic recall of similar past jobs.

The brain's tool surface is deliberately tiny: essentially add a priced item and finish. That narrowness is why a 4B model is reliable here: it does routing and judgment, not arithmetic.

Facts-from-Tools: the rule that runs through everything

The correctness rule is simple to state and ruthless to enforce: any number that reaches the customer (price, quantity, tax, total) comes from a tool (a catalog lookup, a deterministic compute) or from a human edit. Never from the model's free generation.

It holds in the obvious places (the brain calls lookup_price, not "I think this costs $40") and the non-obvious ones:

Edits re-run through a server-authoritative recalc. The browser never computes its own total.
Translation changes words, not digits.
Document Capture (reading a supplier quote) produces Proposed Line Items: the document is the source, but a price only becomes customer-facing once a human confirms it.
The refinement chat keeps a sanitized history: when you reopen an estimate and keep editing, the model sees what you asked ("make it 2 hours") but takes the numbers from the current line items, so a stale dollar figure can never leak back in. Even the conversation's own compaction is done in code, not by asking a model to summarize.

The eval story (the part I'd tell another builder)

Here's the moment that changed how we built this. We ran the agent by hand on a handful of jobs and it looked perfect. Then we wrote an eval set and scored it.

Item F1: 0.367.

Manual testing had been lying to us: we'd unconsciously fed it the cases it handled. The eval set didn't. Two fixes, both measured:

Fuzzy catalog lookup: "refrigerant" should find refrigerant_r410a. F1 jumped to 0.880.
Prompt tuning the brain's tool-calling, to 0.967, with quantity accuracy going from 0.40 to 1.00.

The lesson isn't "we got a good number." It's that the good number only existed because we were willing to be told a bad one first.

Memory that gets smarter, measured the same way

Quillwright recalls similar past jobs to inform a new estimate. The first version used keyword matching. We measured recall@1 = 0.750. Swapping in a small embedder for a semantic re-rank moved it to 0.875, with one honest remaining miss we left in, because a benchmark with no failures is a benchmark you don't trust.

Fine-tuning a small vision model on receipts, and on the real domain

The 🎯 artifact is a MiniCPM-V LoRA fine-tune. On the public CORD receipt benchmark, the tune lifted item F1 from 0.588 → 0.681 (+0.09). But CORD is receipts, not trade invoices, so we also generated a grounded-synthetic set of trade invoices (built from a real 381-entry trade catalog) and fine-tuned on that. In-distribution, the tune went from 0.703 → 0.933 (+0.23), with price accuracy hitting 1.00.

The +0.23 is the honest headline: a small model, fine-tuned on the actual domain, closes most of the gap to a clean read. The +0.09 on CORD is the conservative one: it's a harder, out-of-domain benchmark, and we report it anyway.

Artifacts

Both LoRA adapters are on the Hub, and every number above is reproducible from the eval scripts in the repo:

🎯 Aarya2004/minicpmv-trade-lora: the in-domain trade-invoice tune (0.703 → 0.933).
Aarya2004/minicpmv-cord-lora: the conservative CORD baseline (0.588 → 0.681).

Metric	Before	After
Agent Brain item F1	0.367	0.967
Episodic recall@1	0.750	0.875
MiniCPM-V item F1 (trade, in-domain)	0.703	0.933
MiniCPM-V item F1 (CORD, OOD)	0.588	0.681

"On your own machine", and the honesty around it

The hero claim is no cloud. The honest version of that claim has two parts:

The Private Stack is open small models with no third-party AI APIs. Locally, those models genuinely run on the dev machine via Ollama / llama.cpp, and we filmed an Airplane-Mode Proof: Wi-Fi off, a real forge completing.
The hosted demo Space is wired live to Modal GPUs, the Best Stack: a Nemotron-3-Nano 30B brain, Nemotron-Omni for vision and audio, Aya-Expanse for multilingual. It's the same agent loop and the same Facts-from-Tools guarantees as the local run, just with more headroom; the apps scale to zero when idle, so the Space can fall back to a lightweight CPU mode (and says so on the page) when the models aren't wired. The local Private Stack and the hosted Best Stack are the same family at two tiers: flip one env var and the brain moves from a 4B on a laptop to a 30B on a GPU without touching the agent code.

Same agent, same tools, same Facts-from-Tools guarantee. Only the models behind each role change:

Role	🔒 Private Stack (local)	⚡ Best Stack (Modal)
Brain	Nemotron-3-Nano 4B (NVIDIA)	Nemotron-3-Nano 30B (NVIDIA)
Perception	MiniCPM-V (OpenBMB)	Nemotron-Omni 30B (NVIDIA)
Audio	Cohere Transcribe (on-device)	Nemotron-Omni 30B (NVIDIA)
Multilingual	Aya (Cohere)	Aya-Expanse 8B (Cohere)
Embedding	on-device (sentence-transformers)	same on-device path
Extraction	no local path	Parse extractor (fine-tuned)
Runs offline?	✅ Yes, Airplane-Mode Proof	❌ No, hosted GPU endpoints
Cost / GPU	$0, your hardware	scales to zero when idle

We hold the same line everywhere a feature could over-claim. The "Finalize & Send" feature really texts or emails the estimate on the local path with your own provider creds; on the public Space it drafts only and tells you nothing was transmitted. Same for the phone call and the phone-capture QR: real on the tunneled local machine, honestly framed.

Three ways in

Once the core was solid, the capture surface grew. Each path lands in the same pipeline and the same Facts-from-Tools guarantees:

The Workspace: type/paste a note, add a photo, watch the Digital Apprentice stream.
Call a phone number: describe the job out loud; it transcribes the call, forges a draft estimate, reads the total back, and texts you the PDF. A human approves later.
Scan a QR: capture a photo and voice note on your phone; the desktop forges it live on screen.

What I'd carry to the next project

Write the eval before you trust the demo. 0.37 was the most useful number in the whole build.
Keep the model's job small. The brain is reliable because it never touches arithmetic.
Make the honesty structural, not aspirational. "The model never emits a number" is a code path, not a promise, and it's the same code path on every capture surface.

Quillwright: tell it about the job; it drafts the estimate.

Turning Gemma 4 into an Old Korean Translator

bebechien — Mon, 15 Jun 2026 06:13:07 +0000

There’s something uniquely beautiful about old books. The smell of weathered paper, the texture of the pages, and the stories that have survived generations. But if you’ve ever tried opening a piece of Classical Korean literature—like the Joseon Dynasty novel HongGildongJeon (홍길동전)—you’ll quickly realize that time leaves its own mark on language.

Between the lack of word spacing and obsolete letters like the dot vowel Arae-a (ㆍ) or the soft Yeorin-hieut (ㆆ), reading it feels less like browsing a novel and more like solving a beautiful, ancient puzzle. Even for native speakers, the linguistic gap is massive.

So, that's why I decided to creat this tutorial, a digital bridge between the past and the present. Using Gemma 4 E2B (IT), I set out to create a humble translator that turns Classical Korean into smooth, modern Korean.

The Recipe for Training

To keep things manageable, I ran this on a single NVIDIA T4 GPU (16GB) using Google Colab.

1. Setting Up the Kitchen

First, we pull in our favorite open-source tools: Hugging Face’s transformers, trl for the training loop, and peft so we can use LoRA (Low-Rank Adaptation) to fine-tune our model without needing a massive server cluster.

2. Gathering the Ingredients

For our data, I used a public domain version of HongGildongJeon, paired with a beautiful modern translation by 직지프로 (licensed under Creative Commons).

To make Gemma feel at home, I structured the data into a conversation, guiding the model with a clear system prompt:

[
  {"role": "system", "content": "Translate Classical Korean into Modern Korean."},
  {"role": "user", "content": "됴션국셰둉ᄃᆡ왕즉위십오연의홍희문밧긔ᄒᆞᆫᄌᆡ상이잇스되"},
  {"role": "assistant", "content": "조선국 세종대왕 즉위 십오년에 홍회문 밖에 한 재상이 있으되,"}
]

(Translation note: This line introduces us to a prime minister living just outside the Honghoemun Gate during the 15th year of King Sejong's reign!)

The "Before" Picture

Before giving Gemma any specific training, I ran a quick baseline test. Base models are smart, but archaic grammar is a highly specific domain. Without tuning, Gemma tried its best but ended up giving long, overly literal explanations:

Original Classical Text: ᄇᆡᆨ씨듯고ᄂᆡ심의탄복왈그근본을ᄀᆞᆷ초지아니ᄒᆞ니장부로다ᄒᆞ고ᄌᆡ삼위로ᄒᆞ더라
Human Translation: 백씨 듣고 내심에 탄복 왈, "그 근본을 감추지 아니하니 장부로다!" 하고, 재삼 위로하더라.
Gemma's Initial Guess: "Like the color, the heart's praise said, 'The foundation cannot be deeply felt...'"
Initial Similarity Score: 4.85% 💔

(Translation note: This line actually means - Upon hearing this, Mr. Baek was deeply impressed and said, "He does not hide his true nature; he is a true man!" and comforted him again and again.)

The base model was clearly lost in time. It needed a map.

Teaching Gemma with Care

To train the model efficiently, I used a Parameter-Efficient Fine-Tuning (PEFT) setup with LoRA.

from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

The Secret Sauce: collate_fn

When fine-tuning a chat model to behave like a specific tool, you don't want it to waste energy learning how to re-write your prompt. By using a custom data collator, I masked the system and user inputs (setting their labels to -100), forcing Gemma's loss calculation to focus strictly on generating the correct modern assistant response.

After setting our hyper-parameters to gently cruise through 5 epochs with a learning rate of 2e-5, I hit train.

The Warm "After" Glow

After a bit of patience and letting the trainer do its magic, the results were incredibly rewarding. The character-by-character similarity score jumped all the way up to a brilliant 79.93%!

Look at how it handles the text now:

Original Classical Text: ᄇᆡᆨ씨듯고ᄂᆡ심의탄복왈그근본을ᄀᆞᆷ초지아니ᄒᆞ니장부로다ᄒᆞ고ᄌᆡ삼위로ᄒᆞ더라
Human Translation: 백씨 듣고 내심에 탄복 왈, "그 근본을 감추지 아니하니 장부로다!" 하고, 재삼 위로하더라.
Gemma's Fine-Tuned Translation: 백씨듯 고내심에 탄복 왈, "그 근본을 감초지 아니하니 장부로다." 하고 제삼 위로 하더라.
New Similarity Score: 85.71% ✨

Closing Thoughts

Technology often pushes us relentlessly into the future, but my favorite tech projects are the ones that allow us to look backward with greater clarity. By spending a little time fine-tuning a lightweight model like Gemma 4, we can build tools that preserve cultural history, making ancient wisdom and classic stories accessible to anyone with a laptop.

Next time you find a piece of history that feels just a bit too out of reach, remember that a small dataset and a fine-tuning session might be all you need to bring it into the light.

Here's the structured workflow when you do a fine-tuning for your own domain:

Define a clear goal
Prepare a high-quality dataset and evaluation plan
Verify the model is learning
Evaluate with metrics and human judgment
Deploy and iterate

👉 Check out this tutorial in Gemma Cookbook
👉 Star the repository to support us

LoRA and QLoRA fine-tuning: what they actually do under the hood

Tech_Nuggets — Tue, 09 Jun 2026 16:52:04 +0000

LoRA and QLoRA fine-tuning: what they actually do under the hood

You spent three weeks curating a dataset of legal contract summaries: 12,000 pairs of dense legalese and plain-English counterparts. The model you picked -- a 7B parameter instruction-tuned Llama -- understands your prompts but produces summaries that read like a junior associate who memorized Blackstone but never saw a real merger clause. You reach for full fine-tuning, the obvious move. Then torch.cuda.OutOfMemoryError hits at step 20 on your RTX 4090. You try gradient checkpointing. You try a smaller batch. You try half-precision. Still OOM. Your colleague says "just use LoRA" and walks off, as if that explains anything.

This is the gap this post fills. You do not need another high-level "LoRA is a PEFT method" post. You need the math and the trade-offs that let you decide between LoRA, QLoRA, and full fine-tuning for your specific hardware and quality requirements.

Why parameter-efficient fine-tuning exists

The cost of full fine-tuning is straightforward: a model with P parameters requires storing, at minimum, the model weights (2P bytes for fp16), the optimizer states (8P bytes for Adam), and the gradients (2P bytes). For Llama 3 8B with fp16 parameters, that is roughly 16 GB for weights plus 64 GB for optimizer state plus 16 GB for gradients -- 96 GB total. An RTX 4090 has 24 GB. A single A100-80 has exactly enough, barely, with no room for a batch size above 1.

Parameter-efficient fine-tuning (PEFT) avoids this by keeping the vast majority of the model frozen and training only a tiny set of added parameters. The key insight is that the weight update during fine-tuning, delta W, has low intrinsic rank -- you can approximate it as a product of two much smaller matrices.

LoRA: low-rank adaptation

The LoRA paper (Hu et al., 2021, arXiv 2106.09685) proposed freezing the pretrained weight matrix W in R^(d x d) and learning a low-rank decomposition:

W' = W + BA

where B in R^(d x r), A in R^(r x d), and r << d (typically r = 8 or r = 16). Instead of updating d^2 parameters per layer, you update 2dr. For d = 4096 (a common hidden dimension) and r = 8, that is 65,536 parameters per layer instead of 16,777,216 -- a reduction of roughly 256x.

During the forward pass, the computation becomes:

h = xW' = xW + xBA

The first term uses frozen weights (no gradient needed). The second term is the adapter path. Only A and B receive gradient updates. The original W stays intact, which means you can swap adapters in and out at inference time with zero overhead: just add the adapter weights to W (or compute h = xW + xBA on the fly).

Here is what the architecture looks like for a single Transformer attention layer:

flowchart LR
    subgraph Forward pass
        X[Input x] --> W[W frozen<br/>d x d]
        X --> B_adapt[B d x r]
        B_adapt --> A_adapt[A r x d]
        W --> ADD[Add]
        A_adapt --> ADD
        ADD --> OUT[Output h]
    end

    subgraph Gradient flow
        OUT --> GRAD_B[Gradients flow<br/>to B and A only]
        GRAD_B --> NO[No gradient<br/>through W]
    end

By default, LoRA is applied to the query and value projection matrices in each attention head. You can also extend it to key, output, and the feed-forward layers. Empirically, setting r = 8 on Q and V covers most of the benefit; doubling r beyond 16 rarely beats full fine-tuning by more than a trivial margin.

QLoRA: adding 4-bit quantization

QLoRA (Dettmers et al., 2023, arXiv 2305.14314) asked: what if instead of storing W in fp16, we stored it in 4 bits and still trained adapters on top? The result is a method that can fine-tune a 65B model on a single 48 GB GPU -- something that was previously impossible.

QLoRA makes three specific contributions that work together:

NF4 data type. NormalFloat4 is a quantization scheme designed for normally distributed weights. It maps the 4-bit values to the quantiles of a normal distribution, so the discretization error is minimized exactly where most weight values fall. Informally, NF4 allocates more of its 16 representable values around zero and fewer in the tails.

Double quantization. The quantization constants (scale and offset) themselves take space. QLoRA quantizes these constants from fp32 to fp8, saving another 0.5 bits per parameter. The total is ~4.5 bits per parameter for the base model -- about 3.5 GB for a 7B model instead of 14 GB.

Paged optimizers. When GPU memory runs out during a long training run, the optimizer states are paged to CPU RAM and fetched back as needed. This prevents the OOM crash but can slow training; it is a safety net, not a performance feature.

During training, QLoRA dequantizes the 4-bit weights on the fly for each forward pass, computes the LoRA adapter contribution, and backpropagates only through the low-rank matrices. The dequantized weights never have their gradients computed, which is the whole source of memory savings.

Full comparison

Dimension	Full fine-tuning	LoRA (fp16)	QLoRA (4-bit base + LoRA)
Base model memory	16 GB (7B, fp16)	16 GB (frozen)	~3.5 GB (NF4)
Adapter memory	0	2 GB (r=8, all layers)	2 GB
Optimizer state	~32 GB (Adam)	~4 GB (only adapters)	~4 GB
Total VRAM needed	~56 GB	~22 GB	~9.5 GB
Qual. vs full FT	Baseline	On par or within 0.5%	Within 1-2% on most benchmarks
Multi-task support	One copy per task	One base + N adapters	One base + N adapters
Training speed (7B, A100)	1.0x baseline	~1.4x faster	~0.8x slower (dequant overhead)

The speed trade-off is worth calling out explicitly: QLoRA trains slower than LoRA because every forward pass must dequantize the base weights. On a 7B model with a single A100, LoRA is roughly 1.4x faster than full fine-tuning (less data movement), while QLoRA is about 0.8x the speed of full fine-tuning (dequantization overhead). The memory savings are enormous though, which is why QLoRA dominates the conversation for consumer-grade GPUs.

Common pitfalls

Rank selection is not magic. Setting r = 256 everywhere will not automatically improve results. Higher rank means more trainable parameters but also more noise in the gradient signal. The original LoRA paper found that a rank of 1 already captures meaningful adaptation for many tasks. Start with r = 8 on Q and V, evaluate, and only increase rank on layers that underfit.

Adapter merge at scale. You can merge LoRA weights into W at inference time by computing W' = W + BA for each layer and discarding A and B. This eliminates the adapter inference overhead. But if you have 50 adapters for 50 different clients, you now need 50 copies of the full weights -- trading compute for storage. The right design depends on which resource you have more of.

QLoRA is not free. The NF4 dequantization adds numerical noise. On most tasks the quality loss is within the noise floor (1-2% on MMLU, roughly 0.5% on domain-specific benchmarks). But if you are tuning a model for a precision-critical task such as medical diagnosis or code correctness verification, the trade-off may swing back to full-precision LoRA or full fine-tuning.

Bitsandbytes versions matter. QLoRA depends on the bitsandbytes library for its CUDA quantization kernels. As of June 2026, bitsandbytes is at v0.49.2 and PEFT is at v0.19.1. The API changed between v0.43 and v0.44 -- if you are using an older PEFT, pin to a compatible bitsandbytes version. A version mismatch silently falls back to CPU quantization, which runs orders of magnitude slower.

Scaling the LoRA alpha. The LoRA scaling factor alpha / r controls the magnitude of the adapter update. A common mistake is setting alpha too low (adapter contribution vanishes) or too high (training destabilizes). The paper recommends alpha = 2r as a starting point. Double-check this if your loss curve looks flat after 200 steps.

When NOT to use it

LoRA and QLoRA are the wrong choice when:

You need to change the model's internal representations fundamentally. If you are adding new knowledge that the base model does not have (a new language, a new domain with very different token statistics), low-rank updates may not have enough capacity. Continued pretraining or full fine-tuning will capture the distribution shift more effectively.

Inference latency is your binding constraint and you serve from CPU. LoRA merges into the weights easily on GPU, but on CPU with on-the-fly adapter computation, the extra matrix multiply for BA adds latency. You can merge ahead of time, but then every adapter becomes a separate weight file.

You are fine-tuning a model smaller than 1B parameters. The memory savings of PEFT are less dramatic on small models. A 350M-parameter model consumes roughly 1.4 GB in fp16 -- the adapter overhead of LoRA starts to be a significant fraction of total parameters. A simple full fine-tuning pass may fit with gradient checkpointing and a reasonable batch size.

You need deterministic training across hardware. The quantization paths in QLoRA introduce non-determinism from the dequantization kernel. If you need perfectly reproducible training runs (for auditing or compliance), stick with full-precision LoRA or full fine-tuning with a fixed seed and deterministic CUDA backend.

TL;DR

LoRA approximates the fine-tuning weight update as a product of two low-rank matrices (B in d x r, A in r x d), reducing trainable parameters by 100x-1000x per layer with minimal quality loss.
QLoRA quantizes the frozen base model to 4-bit NF4, then trains LoRA adapters on top. A 65B model fits on a single 48 GB GPU.
The practical memory equation for a 7B model: full fine-tuning ~56 GB, LoRA ~22 GB, QLoRA ~9.5 GB.
Start with r = 8 on Q and V projection layers. Increase rank only if you see clear underfitting on your validation set.
QLoRA trains slower than LoRA (dequantization overhead) but uses roughly half the memory. Pick based on whether you are GPU-bound or time-bound.
Keep bitsandbytes and PEFT versions in sync. A version mismatch causes silent CPU fallback and catastrophic slowdown.
Do not use LoRA/QLoRA for small models (under 1B), for injecting fundamentally new knowledge, or for CPU-latency-sensitive serving where merge-ahead is impractical.

We covered how to adapt an existing model efficiently. The next step is knowing when that adaptation has actually worked -- and that means evaluation. Next post: building a reliable evaluation pipeline that catches regressions before they ship, with or without a labeled test set.

If you are deciding between LoRA and QLoRA for a project right now, the key variable is your GPU budget. 24 GB or less? QLoRA. 48 GB or more? LoRA with a larger rank or full fine-tuning with LoRA on the side for rapid iteration. The code to make either choice work is a single pip install away.

Should You Fine-Tune? The 2026 Decision Ladder (Prompt RAG LoRA Distill)

AI Tech Connect — Sun, 07 Jun 2026 11:30:15 +0000

Originally published on AI Tech Connect.

The one question that saves you a GPU bill Somewhere in the lifecycle of almost every AI feature, a team asks: "should we fine-tune?" The honest answer, the overwhelming majority of the time, is "not yet". It is a question that arrives too early, usually because fine-tuning sounds like the serious, grown-up move — the thing real machine-learning teams do, the lever that turns a generic model into your model. So a Bengaluru fintech building a transaction-narration feature, or a Manchester health-tech drafting clinical letters, spins up a training pipeline before it has wrung the value out of the cheaper rungs below. The cost of asking too early is concrete. Fine-tuning is not free even when the GPU is cheap: you take on a data-curation effort, a training and evaluation loop, a model you…

Read the full article on AI Tech Connect →