A complete walkthrough of speech transcription, LLM inference, tokenization, and 4-bit quantization. Built with Whisper, Llama 3.2, and the HuggingFace ecosystem.
Skill level: Intermediate | Runtime: Google Colab T4 GPU | Models: Whisper-medium, Llama-3.2-3B
Table of Contents
- The Two-Step Pipeline
- Tokenization
- Quantization
- The Chat Template
- Neural Networks and Transformers
- Tradeoffs
- What You Can Build Next
The Two-Step Pipeline
You feed an audio file into a Python script. Minutes later, you get formatted meeting minutes, a summary, and action items. No manual transcription, no human editor.
The notebook splits the problem into two clean stages. Stage one converts audio to text. Stage two converts text to structured meeting minutes.
Both stages run locally on a free Colab T4 GPU. Stage one uses Whisper from OpenAI (or the API version). Stage two uses Meta's Llama 3.2 3B, loaded with 4-bit quantization so it fits in GPU memory.
What is ASR? Automatic Speech Recognition converts raw audio waveforms into text. Whisper treats audio as a sequence prediction problem: it predicts the next token given all previous audio frames.
Two Transcription Options
The notebook gives you two paths. The open-source path runs Whisper locally on the GPU. The API path sends the audio to OpenAI's servers.
from transformers import pipeline
# Open-source: runs locally on T4 GPU
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-medium.en",
dtype=torch.float16,
device='cuda',
return_timestamps=True
)
result = pipe(audio_filename)
transcription = result["text"]
# API option: offloads to OpenAI servers
AUDIO_MODEL = "gpt-4o-mini-transcribe"
transcription = openai.audio.transcriptions.create(
model=AUDIO_MODEL, file=audio_file, response_format="text"
)
| Open-source path | API path | |
|---|---|---|
| Cost | Free | Per minute of audio |
| Data privacy | Stays local | Leaves your environment |
| Setup | Needs GPU | Works anywhere |
| Offline use | Yes | No |
Tokenization: How Text Becomes Numbers
Language models do not read text. They read numbers. Tokenization is the bridge between the two.
A tokenizer splits your text into subword chunks called tokens, then maps each chunk to an integer ID. The word "quantization" might split into tokens like ["quant", "ization"], producing IDs like [42891, 2065]. The model works with these integer sequences from start to finish.
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
# apply_chat_template formats your messages list
# into the exact token sequence Llama expects
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt"
).to("cuda")
Why
pad_tokenmatters: When you process multiple sequences in a batch, they need equal length. Shorter sequences get padded with a special token. Settingpad_token = eos_tokentells the tokenizer which ID to use. Without it, the tokenizer raises an error during batched inference.
The tokenizer also inserts special tokens like <|begin_of_text|> and role markers for system and user. These are instructions to the model, not content. apply_chat_template handles all of this based on the model's expected format.
Quantization: Shrinking the Model to Fit
A 3B parameter model in full 32-bit precision needs around 12 GB of GPU memory. A free T4 has 15 GB total. 4-bit quantization compresses each weight from 32 bits down to 4 bits, cutting memory use by roughly 8x.
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # quantize the quantization constants too
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bfloat16 for speed
bnb_4bit_quant_type="nf4" # NormalFloat4: better for normal distributions
)
model = AutoModelForCausalLM.from_pretrained(
LLAMA,
device_map="auto",
quantization_config=quant_config
)
What NF4 means: NormalFloat4 is a 4-bit data type designed for neural network weights, which typically follow a normal distribution. It places more quantization levels near zero (where most weights cluster) and fewer at the extremes. This beats a naive 4-bit integer scheme in accuracy on nearly every benchmark.
Double quantization (bnb_4bit_use_double_quant=True) quantizes the quantization constants themselves too. It saves about 0.4 bits per parameter on top of base 4-bit compression. Small gain, no cost.
device_map="auto" tells HuggingFace which layers go on GPU versus CPU. For a 3B model on a T4, everything fits on GPU.
The Chat Template: Speaking the Model's Language
Instruction-tuned models like Llama 3.2 Instruct were fine-tuned on conversations in a specific format. Send text in the wrong format and the model either ignores your instructions or produces garbage. The chat template enforces the right format every time.
system_message = """
You produce minutes of meetings from transcripts, with summary,
key discussion points, takeaways and action items with owners,
in markdown format without code blocks.
"""
user_prompt = f"""
Below is an extract transcript of a Denver council meeting.
Please write minutes in markdown without code blocks, including:
- a summary with attendees, location and date
- discussion points
- takeaways
- action items with owners
Transcription:
{transcription}
"""
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_prompt}
]
# apply_chat_template wraps messages in Llama's expected token format
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
System vs user roles: The system message sets the model's persona and output constraints. The user message contains the actual task. Keeping them separate gives you fine-grained control: swap the transcript without touching output format instructions, or change the output format without touching the transcript.
After apply_chat_template, your clean Python dictionaries become a single integer tensor. That tensor goes directly into generate(). No string manipulation after this point — everything is numbers on the GPU.
Neural Networks and Transformer Quantization
A transformer model is a stack of layers. Each layer contains weight matrices stored as 2D arrays of floating point numbers. During a forward pass, the model multiplies your input token embeddings by these matrices over and over, applying attention at each step.
The weight matrices in feed-forward layers are what quantization compresses. At inference time, BitsAndBytes dequantizes each weight block just before the matrix multiplication, performs the multiplication in bfloat16, then moves on. The full 4-bit weights stay compressed in GPU memory at all times.
streamer = TextStreamer(tokenizer)
outputs = model.generate(
inputs,
max_new_tokens=2000,
streamer=streamer # streams tokens to stdout as generated
)
response = tokenizer.decode(outputs[0])
The TextStreamer prints tokens to the console as the model generates them. The model produces one token per forward pass. You see output build word by word because each word triggers a separate forward pass through all layers.
Tradeoffs to Know Before You Ship
Whisper medium vs large
The notebook uses whisper-medium.en. The .en suffix means English-only. It runs faster and uses less memory than the multilingual version. If your meetings include non-English speakers, swap to whisper-large-v3 and expect roughly 3x more GPU memory usage.
3B vs larger Llama models
Llama 3.2 3B handles summarization well. For long meetings with complex technical jargon, a 70B model produces more accurate action items. You cannot run 70B on a free T4, even with 4-bit quantization. You need either a paid Colab Pro instance or API inference.
Float16 vs bfloat16
Whisper runs in torch.float16. Llama's quantized compute runs in bfloat16. Both are 16-bit formats. Float16 has higher precision for small values. Bfloat16 has a wider dynamic range and is less prone to overflow on modern hardware.
The Colab CUDA error that trips everyone up: If you see
CUDA is required but not available for bitsandbytes, your runtime was recycled by Google. Fix: Kernel menu, Disconnect and delete runtime. Reconnect to a fresh T4. Rerun from the top. Do not touch package versions.
What You Can Build Next
The notebook is a foundation. A few extensions that follow naturally:
Speaker diarization. Add pyannote-audio before the Whisper step to tag each segment with a speaker ID. Feed those labels into the prompt so Llama assigns action items to the right person.
Gradio streaming interface. Student Emad S. adapted the notebook to stream tokens into a Gradio UI using TextIteratorStreamer and Python background threads. The result is a browser-based app where you upload audio and watch minutes appear in real time.
Persistent storage. Write the output to a Google Doc via the Drive API. Every meeting auto-archives with a timestamp and a searchable transcript.
References
| Resource | Link |
|---|---|
| MeetingBank dataset | huuuyeah/meetingbank on HuggingFace |
| Audio dataset | MeetingBank Audio |
| Denver extract MP3 | Google Drive |
| Whisper model | openai/whisper-medium.en |
| Llama model | meta-llama/Llama-3.2-3B-Instruct |
| Gradio variation | Emad S. Colab |
Source dataset: MeetingBank. Models: openai/whisper-medium.en and meta-llama/Llama-3.2-3B-Instruct. Quantization: bitsandbytes. Runtime: Google Colab T4 GPU. Framework: HuggingFace Transformers 4.57.6.
Top comments (0)