DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

2 2 2 2 2

The Power of Quantization: Shrinking GPT2, Unleashing Speed

Imagine taking a powerful language model like GPT-2—capable of crafting stories, answering questions, and mimicking human text—and compressing it into a leaner, faster version without gutting its capabilities.

This is the promise of quantization: a technique that reduces the precision of a model’s calculations, trading marginal accuracy for dramatic efficiency gains.

Phase 0: The Technical Setup

    !pip install torch transformers accelerate bitsandbytes psutil

    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    import torch
    import time
    import gc

    def get_memory_usage():
        return torch.cuda.memory_allocated() / 1e6 if torch.cuda.is_available() else 0


    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = "gpt2"
    input_text = "Once upon a time"
Enter fullscreen mode Exit fullscreen mode

Phase 1: The Baseline – Full Precision (FP32)

The experiment begins with GPT-2 in its natural state: 32-bit floating-point precision (FP32). This is the model’s “full power” mode—highly precise but resource-intensive.

  • Memory: Loading the FP32 model consumes 511 MB of GPU memory.
  • Speed: Generating 50 tokens from the prompt “Once upon a time” takes 1.76 seconds.
  • Post-Cleanup Footprint: Even after deleting the model, 458 MB of memory remains occupied.

FP32 works, but it’s bulky.

    # Load tokenizer and base model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"Pre-load memory: {get_memory_usage()} MB")

    # Full precision model
    model_fp32 = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    print(f"Post-load memory: {get_memory_usage()} MB")  # 511.15 MB

    # Inference measurement
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    start_time = time.time()
    output = model_fp32.generate(**inputs, max_length=50)
    inference_time = time.time() - start_time  # 1.76s

    # Cleanup protocol
    del model_fp32, inputs
    gc.collect()
    torch.cuda.empty_cache()
Enter fullscreen mode Exit fullscreen mode

Phase 2: Trimming the Fat – 8-bit Quantization (INT8)

Enter 8-bit quantization, where weights and activations are stored as integers instead of floats. The transformation is immediate:

  • Memory: The INT8 model loads with just 187 MB63% smaller than FP32.
  • Speed: Inference accelerates to 1.38 seconds, a 22% improvement.
  • Post-Cleanup Footprint: Memory drops to 139 MB after deletion.

The model is lighter, faster, and still functional. A clear upgrade.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

Phase 3: The Edge of Efficiency – 4-bit Quantization (INT4)

Now we push further. With 4-bit quantization, weights are compressed to near-minimal precision, and computations use 16-bit floats for stability.

  • Memory: The INT4 model weighs in at 149 MB, 71% lighter than FP32.
  • Speed: Inference time drops to 1.08 seconds, a 39% gain over FP32.
  • Post-Cleanup Footprint: Memory plummets to 58 MB—a fraction of the original.

This isn’t just optimization; it’s reinvention.

    # 8-bit configuration
    quant_config_8bit = BitsAndBytesConfig(load_in_8bit=True)

    print(f"Pre-load memory: {get_memory_usage()} MB")  # 9.18 MB
    model_int8 = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=quant_config_8bit
    )

    # Dynamic input handling
    inputs_int8 = tokenizer(input_text, return_tensors="pt").to(model_int8.device)
    start_time = time.time()
    output = model_int8.generate(**inputs_int8, max_length=50)  # 1.38s
Enter fullscreen mode Exit fullscreen mode

The Trade-offs: Precision vs. Practicality

Quantization isn’t free. Reducing precision can subtly degrade model accuracy, but for many tasks—like casual text generation—the difference is imperceptible. What we gain far outweighs the cost:

  • Memory Efficiency:FP32: 511 MB → INT8: 187 MB → INT4: 149 MB.

Result: Models fit into tighter memory constraints, enabling deployment on consumer GPUs or edge devices.

  • Inference Speed:FP32: 1.76s → INT8: 1.38s → INT4: 1.08s.

Result: Faster responses for real-time applications, from chatbots to automated content generation.


How It Works: The Mechanics of Compression

At its core, quantization maps high-precision values (like 32-bit floats) to lower-precision formats (8- or 4-bit integers). For example:

  • FP32 uses 32 bits per number, capturing fine details but demanding heavy resources.
  • INT8/INT4 use fewer bits, approximating values with minimal loss.

The bitsandbytes library handles this automatically, repacking weights and adjusting computations to maintain stability.


The Visual Proof

The Visual Proof

A side-by-side comparison seals the argument:

  • Memory Usage (Bar Chart): FP32 towers over INT8 and INT4, showcasing the stark reduction in resource demands.
  • Inference Time (Line Plot): The downward slope from FP32 to INT4 highlights the speed gains.

The takeaway? Quantization isn’t just a technical footnote—it’s a practical tool for democratizing AI.

    # Visualization setup
    import matplotlib.pyplot as plt
    quantization_types = ['FP32', 'INT8', 'INT4']

    fig, ax1 = plt.subplots(figsize=(8, 6))
    bars = ax1.bar(quantization_types, memory_usages, color='blue', alpha=0.7)
    ax1.set_ylabel('Memory (MB)', color='blue')

    # Annotation logic
    for bar in bars:
        yval = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2, yval+30, 
                 f'{yval:.2f}', ha='center', va='bottom', 
                 color='blue', fontweight='bold')

    # Dual-axis formatting
    ax2 = ax1.twinx()
    ax2.plot(quantization_types, inference_times, color='red', 
             marker='o', linewidth=2)
    ax2.set_ylabel('Time (sec)', color='red')

    plt.title('Quantization Trade-offs')
    plt.show()
Enter fullscreen mode Exit fullscreen mode

The Final Word

Through quantization, we’ve transformed GPT-2 from a resource-heavy behemoth into a nimble, efficient tool—proving that with the right techniques, even giants can learn to move lightly.

This implementation reveals quantization's power through concrete code and measurements. By modifying just 10-15 lines of configuration, and deploying quantization, we achieved:

  • 71% reduction in memory footprint
  • 39% faster inference speeds

If you're curious and wish to have acccess to the full notebook for the experiment - head over to Google Colab.

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay