DEV Community

Julien Simon
Julien Simon

Posted on • Originally published at julsimon.Medium on

Video — Deep Dive: Quantizing Large Language Models

Video — Deep Dive: Quantizing Large Language Models

Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.

In this 2-part video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we look at and compare quantization techniques: PyTorch, ZeroQuant, bitsandbytes, SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library 😎

Part 1:

Part 2:

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

Heroku

This site is powered by Heroku

Heroku was created by developers, for developers. Get started today and find out why Heroku has been the platform of choice for brands like DEV for over a decade.

Sign Up

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay