DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

3 2 2 2 2

The Power of Quantization: Shrinking GPT2, Unleashing Speed

Imagine taking a powerful language model like GPT-2—capable of crafting stories, answering questions, and mimicking human text—and compressing it into a leaner, faster version without gutting its capabilities.

This is the promise of quantization: a technique that reduces the precision of a model’s calculations, trading marginal accuracy for dramatic efficiency gains.

Phase 0: The Technical Setup

    !pip install torch transformers accelerate bitsandbytes psutil

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import torch
    import time
    import gc

    def get_memory_usage():
        return torch.cuda.memory_allocated() / 1e6 if torch.cuda.is_available() else 0


    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = "gpt2"
    input_text = "Once upon a time"
Enter fullscreen mode Exit fullscreen mode

Phase 1: The Baseline – Full Precision (FP32)

The experiment begins with GPT-2 in its natural state: 32-bit floating-point precision (FP32). This is the model’s “full power” mode—highly precise but resource-intensive.

  • Memory: Loading the FP32 model consumes 511 MB of GPU memory.
  • Speed: Generating 50 tokens from the prompt “Once upon a time” takes 1.76 seconds.
  • Post-Cleanup Footprint: Even after deleting the model, 458 MB of memory remains occupied.

FP32 works, but it’s bulky.

    # Load tokenizer and base model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"Pre-load memory: {get_memory_usage()} MB")

    # Full precision model
    model_fp32 = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    print(f"Post-load memory: {get_memory_usage()} MB")  # 511.15 MB

    # Inference measurement
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    start_time = time.time()
    output = model_fp32.generate(**inputs, max_length=50)
    inference_time = time.time() - start_time  # 1.76s

    # Cleanup protocol
    del model_fp32, inputs
    gc.collect()
    torch.cuda.empty_cache()
Enter fullscreen mode Exit fullscreen mode

Phase 2: Trimming the Fat – 8-bit Quantization (INT8)

Enter 8-bit quantization, where weights and activations are stored as integers instead of floats. The transformation is immediate:

  • Memory: The INT8 model loads with just 187 MB63% smaller than FP32.
  • Speed: Inference accelerates to 1.38 seconds, a 22% improvement.
  • Post-Cleanup Footprint: Memory drops to 139 MB after deletion.

The model is lighter, faster, and still functional. A clear upgrade.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

Phase 3: The Edge of Efficiency – 4-bit Quantization (INT4)

Now we push further. With 4-bit quantization, weights are compressed to near-minimal precision, and computations use 16-bit floats for stability.

  • Memory: The INT4 model weighs in at 149 MB, 71% lighter than FP32.
  • Speed: Inference time drops to 1.08 seconds, a 39% gain over FP32.
  • Post-Cleanup Footprint: Memory plummets to 58 MB—a fraction of the original.

This isn’t just optimization; it’s reinvention.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

The Trade-offs: Precision vs. Practicality

Quantization isn’t free. Reducing precision can subtly degrade model accuracy, but for many tasks—like casual text generation—the difference is imperceptible. What we gain far outweighs the cost:

  • Memory Efficiency:FP32: 511 MB → INT8: 187 MB → INT4: 149 MB.

Result: Models fit into tighter memory constraints, enabling deployment on consumer GPUs or edge devices.

  • Inference Speed:FP32: 1.76s → INT8: 1.38s → INT4: 1.08s.

Result: Faster responses for real-time applications, from chatbots to automated content generation.


How It Works: The Mechanics of Compression

At its core, quantization maps high-precision values (like 32-bit floats) to lower-precision formats (8- or 4-bit integers). For example:

  • FP32 uses 32 bits per number, capturing fine details but demanding heavy resources.
  • INT8/INT4 use fewer bits, approximating values with minimal loss.

The bitsandbytes library handles this automatically, repacking weights and adjusting computations to maintain stability.


The Visual Proof

The Visual Proof

A side-by-side comparison seals the argument:

  • Memory Usage (Bar Chart): FP32 towers over INT8 and INT4, showcasing the stark reduction in resource demands.
  • Inference Time (Line Plot): The downward slope from FP32 to INT4 highlights the speed gains.

The takeaway? Quantization isn’t just a technical footnote—it’s a practical tool for democratizing AI.

    # Visualization setup
    import matplotlib.pyplot as plt
    quantization_types = ['FP32', 'INT8', 'INT4']

    fig, ax1 = plt.subplots(figsize=(8, 6))
    bars = ax1.bar(quantization_types, memory_usages, color='blue', alpha=0.7)
    ax1.set_ylabel('Memory (MB)', color='blue')

    # Annotation logic
    for bar in bars:
        yval = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2, yval+30, 
                 f'{yval:.2f}', ha='center', va='bottom', 
                 color='blue', fontweight='bold')

    # Dual-axis formatting
    ax2 = ax1.twinx()
    ax2.plot(quantization_types, inference_times, color='red', 
             marker='o', linewidth=2)
    ax2.set_ylabel('Time (sec)', color='red')

    plt.title('Quantization Trade-offs')
    plt.show()
Enter fullscreen mode Exit fullscreen mode

The Final Word

Through quantization, we’ve transformed GPT-2 from a resource-heavy behemoth into a nimble, efficient tool—proving that with the right techniques, even giants can learn to move lightly.

This implementation reveals quantization's power through concrete code and measurements. By modifying just 10-15 lines of configuration, and deploying quantization, we achieved:

  • 71% reduction in memory footprint
  • 39% faster inference speeds

If you're curious and wish to have acccess to the full notebook for the experiment - head over to Google Colab.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (1)

Collapse
 
vijayr00 profile image
Vijay Kumar

Amazingly Insightful Read !

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more