DEV Community

Alain Airom
Alain Airom

Posted on

Run Big LLMs on Small GPUs: A Hands-On Guide to 4-bit Quantization and QLoRA

Save the planet and adapt the LLM to your use-case!

Introduction

The process of reducing a Large Language Model (LLM) to FP4 (4-bit Floating Point) precision is a quantization technique primarily used to drastically decrease the memory required and accelerate inference (text generation), allowing larger models to be run on less powerful hardware, such as a single desktop GPU. Practically, this is typically achieved by utilizing specialized libraries — like bitsandbytes—that integrate directly with machine learning frameworks such as PyTorch and the Hugging Face Transformers library.

This technique directly supports a “greener” or more sustainable AI approach by significantly reducing the energy consumption required during the model’s inference phase. Since 4-bit quantization reduces the model’s size by up to 75% compared to 16-bit precision, the computer system must access and process four times less data from the GPU’s memory (VRAM). This reduced memory bandwidth and the ability to perform faster, lower-precision arithmetic translates directly into fewer computational operations (FLOPs) and a lower power draw from the GPU per generated token, enabling the deployment of large, powerful models on less energy-intensive hardware and minimizing the overall carbon footprint of the AI solution.

The core idea is that while reducing a Large Language Model (LLM) to lower precision (like 4-bit) dramatically improves sustainability by reducing hardware and energy requirements, this method carries inherent risks, primarily a loss of precision that could lead to unexpected or substandard model outputs (i.e., “going sideways”). Therefore, the adoption of low-precision quantization hinges entirely on a cost-benefit analysis: if the potential negative impact on business or usage quality is deemed acceptable given the significant gains in efficiency and environmental friendliness, then it represents a highly viable and sustainable approach to deploying Generative AI.

🛠️ The Practical Method: 4-bit Quantization (NF4/FP4) — Concepts and Implementation (GPU required)

As mnetioned above, the most common and accessible way to apply 4-bit quantization on an LLM is by leveraging the bitsandbytes library integrated with Hugging Face Transformers and PyTorch. This library provides highly optimized CUDA kernels for low-bit computation.

  • Installing the pre-requisites
pip install transformers accelerate bitsandbytes torch
Enter fullscreen mode Exit fullscreen mode
  • Configuration and required imports
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time

# --- Model Configuration, theorically whatever model ---
model_id = "meta-llama/Llama-2-7b-chat-hf"
Enter fullscreen mode Exit fullscreen mode
  • Defining the 4-bit Quantization Configuration: using the BitsAndBytesConfig to tell the transformers library how to load the model.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)
Enter fullscreen mode Exit fullscreen mode
  • Loading the Quantized Model: when from_pretrained is called with the quantization_config, the large Linear layers of the LLM are automatically replaced with the highly optimized 4-bit counterparts from bitsandbytes
# Loading Tokenizer and Quantized model 
print(f"Loading model {model_id} in 4-bit...")
start_time = time.time()

# 'device_map="auto"' uses the 'accelerate' library to intelligently distribute layers onto the GPU(s).
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto", # Move model to GPU(s)
    torch_dtype=torch.bfloat16, # Default dtype for non-quantized layers
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

load_time = time.time() - start_time
print(f"4-bit model loaded in {load_time:.2f} seconds.")
print(f"Model size in VRAM is now approximately {model.get_memory_footprint() / (1024**3):.2f} GB (original was ~14 GB).")
Enter fullscreen mode Exit fullscreen mode
  • Generating Text (Inference): preparing the prompt and generate the response;
prompt = "Explain the principle of 4-bit quantization in Large Language Models in one simple sentence."

# Llama models chat template format
messages = [
    {"role": "system", "content": "You are a concise and expert AI assistant in machine learning."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt"
).to(model.device) 

print("\n--- Starting Generation ---")
generation_start_time = time.time()

output = model.generate(
    input_ids,
    max_new_tokens=150,           # Limit response length
    do_sample=True,               # Enable sampling (more creative)
    temperature=0.7,              # Control randomness
    top_k=50,                     # Filter tokens
    pad_token_id=tokenizer.eos_token_id # Important for padding
)

generation_time = time.time() - generation_start_time
print(f"Generation completed in {generation_time:.2f} seconds.")

# Decode 
response = tokenizer.decode(output[0], skip_special_tokens=True)

# Extract 
response_text = response.split(prompt)[-1].strip()

print("\n--- 4-bit Model Response ---")
print(response_text)
print("---------------------------------")
Enter fullscreen mode Exit fullscreen mode
  • The all in one complete application is the following 👨‍💻
# qlora_4bit_inference.py
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

def main():
    """
    Main function to configure, load, and run inference on the 4-bit quantized LLM.
    """

   model_id = "meta-llama/Llama-2-7b-chat-hf"

   bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,

        bnb_4bit_quant_type="nf4", 

        bnb_4bit_use_double_quant=True, 

        bnb_4bit_compute_dtype=torch.bfloat16, 
    )

    print("-" * 50)
    print(f"Attempting to load model '{model_id}' in 4-bit...")
    print("-" * 50)

    start_time = time.time()

    try:
       model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config,
            device_map="auto",          # Move model to GPU(s)
            torch_dtype=torch.bfloat16,  # Default dtype for non-quantized layers
            trust_remote_code=True,      # Required for some models
        )
        tokenizer = AutoTokenizer.from_pretrained(model_id)

        load_time = time.time() - start_time

        mem_footprint = model.get_memory_footprint() / (1024**3)

        print(f"\n✅ 4-bit model loaded successfully in {load_time:.2f} seconds.")
        print(f"Memory Footprint: Approximately {mem_footprint:.2f} GB (Original 16-bit model is ~14 GB).")
        print("-" * 50)

    except Exception as e:
        print(f"\n❌ ERROR: Failed to load model. This usually means missing dependencies or no CUDA-enabled GPU.")
        print(f"Error details: {e}")
        # Exit if error
        return 


    prompt = "Explain the principle of 4-bit quantization in Large Language Models in one simple sentence."

   messages = [
        {"role": "system", "content": "You are a concise and expert AI assistant in machine learning. Your response must be extremely brief."},
        {"role": "user", "content": prompt}
    ]

    device = model.device if torch.cuda.is_available() else 'cpu'

    input_ids = tokenizer.apply_chat_template(
        messages, 
        return_tensors="pt"
    ).to(device) 

    print("\n--- Starting Text Generation (Inference) ---")
    generation_start_time = time.time()

    output = model.generate(
        input_ids,
        max_new_tokens=150,           # Limit response length
        do_sample=True,               # Enable sampling (more creative)
        temperature=0.7,              # Control randomness
        top_k=50,                     # Filter tokens
        pad_token_id=tokenizer.eos_token_id # Important for padding
    )

    generation_time = time.time() - generation_start_time
    print(f"Generation completed in {generation_time:.2f} seconds.")

   response = tokenizer.decode(output[0], skip_special_tokens=True)

   if "assistant" in response.lower():
        response_text = response.split("assistant")[-1].strip()
    elif prompt in response:
        response_text = response.split(prompt)[-1].strip()
    else:
       response_text = response

    print("\n--- 4-bit Model Response ---")
    print(response_text)
    print("-" * 50)

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

🧠 Why 4-bit Works and Requires a GPU

The Principle of Quantization

Quantization reduces the number of bits used to represent the model’s weights (parameters).

  • A standard weight might be stored as BF16 (16 bits, 2 bytes).
  • A 4-bit weight is stored using only 4 bits (0.5 bytes).
  • For a 7 Billion parameter model, this reduces the required VRAM from ≈14 GB to ≈3.5 GB.

Maintaining Accuracy

Since 4 bits can only represent 16 distinct values (24), the original high-precision values are mapped to these 16 values using a scaling factor and an offset, which is calculated per block of weights to minimize information loss.

The Need for a GPU

Even after reducing the memory footprint, the core operation of an LLM — massive parallel matrix multiplications — must still be performed quickly.

  • The GPU (Graphics Processing Unit) is essential because it has thousands of specialized cores (like CUDA Cores or Tensor Cores) designed to execute these parallel calculations rapidly.
  • The bitsandbytes library relies on CUDA kernels to efficiently handle the de-quantization (converting the 4-bit weights back to 16-bit) and the subsequent high-speed matrix multiplication on the GPU. Without a GPU, even a 4-bit model would run extremely slowly on a CPU, making real-time inference impractical.

Fine-tuning a 4-bit quantized model using the QLoRA (Quantized Low-Rank Adaptation) technique is currently the state-of-the-art method for adapting massive LLMs efficiently on consumer hardware.

🚀 Understanding QLoRA (Quantized Low-Rank Adaptation)

QLoRA allows you to update only a tiny fraction of the model’s parameters while keeping the vast majority of the weights stored and loaded in memory at 4-bit, saving enormous amounts of VRAM.

QLoRA is an extension of the LoRA (Low-Rank Adaptation) technique, designed specifically to work with models loaded in 4-bit precision using bitsandbytes.

The Core Idea: Adapter Layers

Instead of fine-tuning all N parameters of the large pre-trained model (W4−bit​), QLoRA introduces small, trainable, low-rank adapter matrices (A and B) alongside the pre-trained weights.

  • The large pre-trained weights (W4−bit​) are kept frozen and remain in 4-bit (NF4) memory.
  • Only the small adapter matrices (A and B) are trainable and stored in high precision (BF16/FP16).

The Forward Pass Calculation
During the forward pass, the small update ΔW=A×B is calculated and added to the frozen weights:

This is done efficiently by:

  1. De-quantizing W4−bit​ (or parts of it) to BF16 for the calculation.
  2. Adding the small, high-precision update ΔW (which is A×B).
  3. Performing the final matrix multiplication.

Key Memory Savings



| Component                                  | Storage Precision | Trainable?                 | Memory Impact                             |
| ------------------------------------------ | ----------------- | -------------------------- | ----------------------------------------- |
| **Base Model Weights (\*W\*4−\*bi\**t\*)** | **4-bit (NF4)**   | No (Frozen)                | **Massive Savings** (e.g., 3.5 GB for 7B) |
| **Adapter Weights (\*A\* and \*B\*)**      | **16-bit (BF16)** | Yes                        | **Minimal** (e.g., ≈ 50 MB - 150 MB)      |
| **Gradients/Optimiser States**             | 16/32-bit         | Yes (Only for *A* and *B*) | **Minimal** (Only for the tiny adapters)  |
Enter fullscreen mode Exit fullscreen mode

By only training the adapters, QLoRA can achieve performance comparable to full fine-tuning with a memory footprint that is 20x to 60x smaller.

🛠️ Practical Implementation Steps for QLoRA

The implementation closely mirrors the 4-bit loading process, with the addition of the PEFT (Parameter Efficient Fine-Tuning) library.

  • Additional Dependencies: we need the peft library for the LoRA configuration and functionality, and datasets for training data.
pip install peft trl datasets
Enter fullscreen mode Exit fullscreen mode
  • The conceptual code: a conceptual flow for QLoRA fine-tuning using transformers and peft:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer # Uses a specialized Trainer for chat/instruction data

model_id = "meta-llama/Llama-2-7b-chat-hf" 

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.config.use_cache = False 
tokenizer.pad_token = tokenizer.eos_token 

lora_config = LoraConfig(
    r=64, # The 'rank' (r) is the dimension of the adapter matrices (A and B). Higher r = more parameters.
    lora_alpha=16, # Scaling factor for the weights.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Which layers to apply LoRA to (typically attention layers).
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM", # Specify the task type for the model
)

# Apply QLoRA 
model = get_peft_model(model, lora_config)
print("QLoRA Model Trainable Parameters:")
model.print_trainable_parameters() # This will show a very small number, typically < 0.1% of total params.

# Load Data ---
dataset = load_dataset("json", data_files="my_instruction_data.json", split="train")

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./qlora_results",
    num_train_epochs=1,
    per_device_train_batch_size=4, # Adjust based on your VRAM, higher is better
    gradient_accumulation_steps=4, # Simulate a larger batch size
    optim="paged_adamw_8bit", # Optimiser designed to handle large memory needs by paging tensors
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    fp16=False, # Use BF16 for training if available (set to True for older GPUs)
    bf16=True, # Recommended for modern NVIDIA GPUs
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    dataset_text_field="text", # The column in your dataset containing the text
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
)

# Start training the tiny adapter layers
trainer.train()

# --- 6. Save the Adapters ---
# After training, you only save the small, trained LoRA weights.
trainer.model.save_pretrained("./final_qlora_adapters")
Enter fullscreen mode Exit fullscreen mode
  • Mock data 🗄️
...
[
  {
    "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>Define QLoRA in simple terms.[/INST] QLoRA is an efficient fine-tuning technique that uses 4-bit quantization to reduce memory usage while adding small, trainable adapter layers (LoRA) to a large language model.</s>"
  },
  {
    "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>What is the main benefit of 4-bit quantization?[/INST] The main benefit is a dramatic reduction in the memory footprint (VRAM usage) required to load and run a massive model, making it accessible on consumer hardware.</s>"
  },
  {
    "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>Why do we use the paged_adamw_8bit optimizer?[/INST] The paged_adamw_8bit optimizer is specifically designed to manage memory more efficiently by 'paging' the optimizer states between CPU and GPU memory, preventing out-of-memory (OOM) errors during training.</s>"
  },
  {
    "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>What does the 'r' parameter in LoRA config control?[/INST] The 'r' parameter, or rank, defines the dimension (capacity) of the adapter matrices. A higher 'r' typically leads to better performance but slightly more memory usage.</s>"
  }
]
...
Enter fullscreen mode Exit fullscreen mode
  • And the complete code 👨‍💻
# qlora_finetune.py
import torch
import time
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer 

def main():
    """
    Main function to configure, load, and run the QLoRA fine-tuning process.
    """

    # --- MODEL SETUP ---
    model_id = "meta-llama/Llama-2-7b-chat-hf" 
    output_dir = "./qlora_results"

    print("-" * 70)
    print("Starting QLoRA Fine-Tuning Setup...")
    print(f"Target Model: {model_id}")
    print(f"Output Directory: {output_dir}")
    print("-" * 70)

   bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4", 
        bnb_4bit_use_double_quant=True, 
        bnb_4bit_compute_dtype=torch.bfloat16,  # High precision for calculations
    )

   start_time = time.time()
    try:
        print("Loading base model in 4-bit (This step requires a CUDA GPU)...")
       model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config,
            device_map="auto",
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )
        tokenizer = AutoTokenizer.from_pretrained(model_id)

       model.config.use_cache = False  
        tokenizer.pad_token = tokenizer.eos_token  

        load_time = time.time() - start_time
        mem_footprint = model.get_memory_footprint() / (1024**3)
        print(f"Model loaded in {load_time:.2f} seconds.")
        print(f"Estimated VRAM usage: {mem_footprint:.2f} GB. ")
        print("-" * 70)

    except Exception as e:
        print(f"\n❌ ERROR: Failed to load model. Ensure you have a CUDA-enabled GPU and all libraries installed correctly.")
        print(f"Error details: {e}")
        return

   lora_config = LoraConfig(
        r=64,                      # Rank: Dimension of the adapter matrices (A and B). Determines capacity.
        lora_alpha=16,             # Scaling factor for the weights.
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Layers to inject adapters into
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )

    model = get_peft_model(model, lora_config)
    print("QLoRA Model Setup Complete:")
    model.print_trainable_parameters() 
    print("Only the adapter layers (~0.07% of the total model) will be trained. ")
    print("-" * 70)

    print("Loading instruction dataset...")
    if not os.path.exists("my_instruction_data.json"):
        print("\n❌ ERROR: 'my_instruction_data.json' not found. Please create it first (see setup_and_run_qlora.sh).")
        return

    dataset = load_dataset("json", data_files="my_instruction_data.json", split="train")
    print(f"Dataset loaded successfully with {len(dataset)} examples.")

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4, # Effectively batch size of 16
        optim="paged_adamw_8bit",    # Optimizer designed to be VRAM-efficient
        logging_steps=10,
        save_strategy="epoch",
        learning_rate=2e-4,
        fp16=False,
        bf16=True,                   # Use BF16 for training intermediate values (recommended)
    )

    print("\nStarting SFTTrainer setup...")
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=lora_config,
        dataset_text_field="text", 
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_args,
    )

    print("\n🚀 Beginning QLoRA Training...")
    trainer.train()
    print("✅ Training complete.")

    adapter_save_path = "./final_qlora_adapters"
    print(f"\nSaving final QLoRA adapters to: {adapter_save_path}")
    trainer.model.save_pretrained(adapter_save_path)
    print("✅ Adapters saved successfully. You can now merge these with the base model for deployment.")
    print("-" * 70)

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode
  • After this process, you can load the original 4-bit model and load these small adapter weights on top of it for inference, customizing the model’s behavior without needing to save the entire 14 GB model again.
  • The process could be automated with this script 🐚
echo "Installing required Python libraries (this may take a few minutes)..."
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

pip install transformers accelerate bitsandbytes peft trl datasets sentencepiece

# Mock Data File 
echo "Creating the mock instruction data file: my_instruction_data.json"

python -c '
import json
data = [
    {
        "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>Define QLoRA in simple terms.[/INST] QLoRA is an efficient fine-tuning technique that uses 4-bit quantization to reduce memory usage while adding small, trainable adapter layers (LoRA) to a large language model.</s>"
    },
    {
        "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>What is the main benefit of 4-bit quantization?[/INST] The main benefit is a dramatic reduction in the memory footprint (VRAM usage) required to load and run a massive model, making it accessible on consumer hardware.</s>"
    },
    {
        "text": "<s>[INST] <<SYS>>You are an expert in computer science terminology. Answer questions concisely.<</SYS>>What is the 'r' parameter in LoRA config?[/INST] The 'r' parameter, or rank, defines the dimension (capacity) of the adapter matrices. A higher 'r' typically leads to better performance but slightly more memory usage.</s>"
    }
]
with open("my_instruction_data.json", "w") as f:
    json.dump(data, f, indent=2)

print("my_instruction_data.json created.")
'

echo "Executing the QLoRA Fine-Tuning script (qlora_finetune.py)..."
python qlora_finetune.py

echo "Execution finished. Check the './final_qlora_adapters' folder for the small, trained weights!"
Enter fullscreen mode Exit fullscreen mode

Conclusion

The industry-wide shift to 4-bit Quantization and QLoRA is primarily an act of democratizing the power of massive Large Language Models (LLMs) by solving the fundamental challenge of GPU memory scarcity and computational cost. Traditional full fine-tuning of models like LLaMA 70B can require hundreds of gigabytes of VRAM, pricing all but the largest tech firms out of the specialized AI market. QLoRA (Quantized Low-Rank Adaptation) addresses this by intelligently freezing the bulk of the model in a highly compressed 4-bit state, while only introducing and training tiny, high-precision Low-Rank Adapters (LoRA). The resulting benefit to the market is immense: companies and small research labs can now fine-tune multi-billion parameter models — achieving 99% of the performance of full fine-tuning — on a single, consumer-grade GPU with 48GB of VRAM or less. This efficiency accelerates iteration speed, dramatically lowers infrastructure spend (cost savings of up to 80%), and allows businesses to create hyper-specialized, domain-specific AI solutions (e.g., legal reasoning or proprietary customer service bots) without needing to acquire a supercomputer. QLoRA is thus the core trick enabling the current wave of accessible, customized AI development.

Link and References

Top comments (0)