DEV Community

Julien Simon
Julien Simon

Posted on • Originally published at julsimon.Medium on

Video — Deep Dive: Quantizing Large Language Models

Video — Deep Dive: Quantizing Large Language Models

Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.

In this 2-part video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we look at and compare quantization techniques: PyTorch, ZeroQuant, bitsandbytes, SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library 😎

Part 1:

Part 2:

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more