Deep Dive: OpenAI's GPT-OSS

#llm #ai #gpt #openai

Deep Dive: OpenAI's GPT-OSS

Host: Machine Learning Tokyo
Speaker Alex Shen (AI specialist at Microsoft Tokyo)
Link: https://www.meetup.com/machine-learning-tokyo/events/310580701/

(These are my memos taken during the talk. Posted with permission.)

Overview

Deep dive into GPT OSS, an open-source model from OpenAI, including 120B and 20B variants.
Performance efficiency:
- 120B model can run on a single 80GB VRAM GPU
- 20B model can run locally on laptop
Key architectural features
- group query attention
- sliding window attention
- learned biases
Pre-trained on text-only data
Post-trained using reinforcement learning

Outline

Introduction and Overview of GPT OSS
Architecture and Normalization of GPT OSS
Algorithm Gems of the Attention Block
Context Length Extension and Positional Encoding
Training and Post-Training Details

Introduction and Overview of GPT OSS

GPT OSS: the first reasoning open-weight model from OpenAI
Two size variants: 120B and 20B
Technical highlight details of the models:
- Mixture of Experts (MoE) architecture
- ability to run on a single 80GB VRAM GPU

Architecture and Normalization of GPT OSS

Architecture: standard SOTA LLM architecture with transform blocks and attention blocks.
MoE (Mixtral of Experts) part of the transformer layer: activation in speed loop and quantization to 4.25 bits per parameter
Importance of normalization in deep neural networks, compared with Layer Norm and RMS Norm
- Layer Norm: Originally Used in original Transformer architecture
- RMS Norm:
  - computationally cheaper (removing the mean subtraction step)
  - provides similar quality to Layer Norm
Placement of normalization in the initial transformer, contrasting post-norm and pre-norm
- Post-norm
  - applies normalization after the residual connection
  - stabilizes each layer’s output
  - can make training deep models more difficult, requiring careful learning rate warmup and risking vanishing gradients
- Pre-norm
  - applies normalization before the attention and MOE blocks
  - more stable for training very deep models
  - less sensitive to learning rate schedules or warmup
  - preferred choice in modern large language models, GPT-OSS uses this

Algorithm Gems of the Attention Block

Group Query Attention (GQA): splits the query head into groups and shares one set of key-value (KV) heads per group
- Compared to Multi-Head Attention (MHA) from original Transformer model: reduce number of KV parameters and memory required for the KV cache
- allows each group to share a set of KV heads -> significantly lower memory and compute requirements during inference
Sliding window attention: a local, sparse self-attention pattern that reduces computational complexity
- allows large language models to efficiently process very long sequences of text
- restricts each token to attend only to a fixed window of neighboring tokens
- computation scales linearly with the length of the input
- reduce memory and compute requirements
- GPT OSS can handle extremely long contexts (up to 128k tokens) that would be impractical with standard attention mechanisms
Alternating between dense and sliding window layers in GPT OSS allows the model to focus on both local and global information

Context Length Extension and Positional Encoding

Added learned bias added to the attention block enables the model to pay no attention to any tokens
- allocate some of its attention probability mass to a special "dummy" position, rather than being forced to distribute it among the actual tokens
- good for an attention head to be able to pay no attention because this allows the model to avoid focusing on irrelevant or unhelpful information
Importance of positional encoding: comparing sinusoidal position encoding and Rotary Position Embedding (RoPE) position encoding
- Positional encoding helps model distinguish "dog bites man" from "man bites dog"
- Sinusoidal position encoding:
  - adds a unique vector of sines and cosines to each token’s embedding, providing absolute position information
  - allows the model to learn about word order and distance
  - limited in how well it can generalize to longer sequences than it was trained on
- RoPE (Rotary Position Embedding):
  - improves by encoding position information through a rotation matrix applied to the query and key vectors
  - captures relative position information more directly and preserves the norm of the vectors, helps with training stability
  - enables better extrapolation to longer sequences
  - de facto standard in modern large language models.
YARN (Yet Another RoPE Extension) improves RoPE's performance at long contexts by scaling the frequency dimensions.
- scales the frequency dimensions used in the positional encoding
- standard RoPE: high-frequency components can "wrap around" too quickly when the context length is much longer than what model was trained on -> cause position information to become ambiguous or unstable
- YARN adjusts (scales) frequency dimensions so positional encoding remains meaningful and stable even for very long sequences
- Scaling prevents rapid repetition of high-frequency components -> allowing model to better handle and reason over long contexts without losing track of position information
- frequency dimension: component of the encoding vector that oscillates at specific frequency
  - In sinusoidal and rotary positional encodings (like RoPE), each position in the sequence is mapped to vector where each element (or pair of elements) varies sinusoidally with a different frequency
  - Lower frequency dimensions change slowly across positions, capture broad positional trends
  - Higher frequency dimensions change rapidly, capture fine-grained positional differences
  - Combining multiple frequency dimensions so model can uniquely represent each position in sequence, encode both absolute and relative position information
Benefits of YARN:
- better capture of fine-grained local attention
- coarse-grained global focus

Training and Post-Training Details

Pre-training process: done on text-only data and applied the CBRN filter
- CBRN stands for "Chemical, Biological, Radiological, and Nuclear"
Training time:
- 2 million GPU hours for 120B model
- 100 GPU hours for the 20B model
Post-training process: reinforcement learning with the "train of thought" format
New Harmony chat format:
- newly introduced response format for gpt-oss models
- structures conversations with clear roles, channels, and special tokens—designed for better interoperability, reasoning separation, and tool integration
- variable reasoning effort levels: model’s ability to adjust amount of computational/cognitive effort it expends when solving different tasks or answering different questions
  - allocate more resources (such as more layers, more steps, or more complex chains of thought) to harder problems
  - implemented through mechanisms like:
    - dynamic depth (of layers)
    - adaptive computation
    - explicit reasoning steps (e.g., chain-of-thought prompting)