Paper Review: QLoRA — 4-Bit Finetuning That Runs 65B Models on One GPU

#qlora #quantization #llmfinetuning #paperreview

The Memory Wall Problem

Finetuning a 65-billion parameter LLM requires roughly 780GB of GPU memory in full precision. Even with LoRA (which I covered in my LoRA paper review — wait, that's Korean, so linking won't help much), you still need to load the full model weights in 16-bit, which means about 130GB just for a 65B model. That's more than any single consumer GPU can handle.

QLoRA changed that. You can read the original paper by Dettmers et al. (2023).

The paper's headline result: finetuning a 65B parameter model on a single 48GB GPU with no performance degradation compared to full 16-bit finetuning. That's not a typo — they matched full-precision finetuning quality while using roughly 4x less memory.