Deep Dive: OpenAI's GPT-OSS
- Host: Machine Learning Tokyo
- Speaker Alex Shen (AI specialist at Microsoft Tokyo)
- Link: https://www.meetup.com/machine-learning-tokyo/events/310580701/
(These are my memos taken during the talk. Posted with permission.)
Overview
- Deep dive into GPT OSS, an open-source model from OpenAI, including 120B and 20B variants.
- Performance efficiency:
- 120B model can run on a single 80GB VRAM GPU
- 20B model can run locally on laptop
- Key architectural features
- group query attention
- sliding window attention
- learned biases
- Pre-trained on text-only data
- Post-trained using reinforcement learning
Outline
- Introduction and Overview of GPT OSS
- Architecture and Normalization of GPT OSS
- Algorithm Gems of the Attention Block
- Context Length Extension and Positional Encoding
- Training and Post-Training Details
Introduction and Overview of GPT OSS
- GPT OSS: the first reasoning open-weight model from OpenAI
- Two size variants: 120B and 20B
- Technical highlight details of the models:
- Mixture of Experts (MoE) architecture
- ability to run on a single 80GB VRAM GPU
Architecture and Normalization of GPT OSS
- Architecture: standard SOTA LLM architecture with transform blocks and attention blocks.
- MoE (Mixtral of Experts) part of the transformer layer: activation in speed loop and quantization to 4.25 bits per parameter
- Importance of normalization in deep neural networks, compared with Layer Norm and RMS Norm
- Layer Norm: Originally Used in original Transformer architecture
- RMS Norm:
- computationally cheaper (removing the mean subtraction step)
- provides similar quality to Layer Norm
- Placement of normalization in the initial transformer, contrasting post-norm and pre-norm
- Post-norm
- applies normalization after the residual connection
- stabilizes each layer’s output
- can make training deep models more difficult, requiring careful learning rate warmup and risking vanishing gradients
- Pre-norm
- applies normalization before the attention and MOE blocks
- more stable for training very deep models
- less sensitive to learning rate schedules or warmup
- preferred choice in modern large language models, GPT-OSS uses this
- Post-norm
Algorithm Gems of the Attention Block
- Group Query Attention (GQA): splits the query head into groups and shares one set of key-value (KV) heads per group
- Compared to Multi-Head Attention (MHA) from original Transformer model: reduce number of KV parameters and memory required for the KV cache
- allows each group to share a set of KV heads -> significantly lower memory and compute requirements during inference
- Sliding window attention: a local, sparse self-attention pattern that reduces computational complexity
- allows large language models to efficiently process very long sequences of text
- restricts each token to attend only to a fixed window of neighboring tokens
- computation scales linearly with the length of the input
- reduce memory and compute requirements
- GPT OSS can handle extremely long contexts (up to 128k tokens) that would be impractical with standard attention mechanisms
- Alternating between dense and sliding window layers in GPT OSS allows the model to focus on both local and global information
Context Length Extension and Positional Encoding
- Added learned bias added to the attention block enables the model to pay no attention to any tokens
- allocate some of its attention probability mass to a special "dummy" position, rather than being forced to distribute it among the actual tokens
- good for an attention head to be able to pay no attention because this allows the model to avoid focusing on irrelevant or unhelpful information
- Importance of positional encoding: comparing sinusoidal position encoding and Rotary Position Embedding (RoPE) position encoding
- Positional encoding helps model distinguish "dog bites man" from "man bites dog"
- Sinusoidal position encoding:
- adds a unique vector of sines and cosines to each token’s embedding, providing absolute position information
- allows the model to learn about word order and distance
- limited in how well it can generalize to longer sequences than it was trained on
- RoPE (Rotary Position Embedding):
- improves by encoding position information through a rotation matrix applied to the query and key vectors
- captures relative position information more directly and preserves the norm of the vectors, helps with training stability
- enables better extrapolation to longer sequences
- de facto standard in modern large language models.
- YARN (Yet Another RoPE Extension) improves RoPE's performance at long contexts by scaling the frequency dimensions.
- scales the frequency dimensions used in the positional encoding
- standard RoPE: high-frequency components can "wrap around" too quickly when the context length is much longer than what model was trained on -> cause position information to become ambiguous or unstable
- YARN adjusts (scales) frequency dimensions so positional encoding remains meaningful and stable even for very long sequences
- Scaling prevents rapid repetition of high-frequency components -> allowing model to better handle and reason over long contexts without losing track of position information
- frequency dimension: component of the encoding vector that oscillates at specific frequency
- In sinusoidal and rotary positional encodings (like RoPE), each position in the sequence is mapped to vector where each element (or pair of elements) varies sinusoidally with a different frequency
- Lower frequency dimensions change slowly across positions, capture broad positional trends
- Higher frequency dimensions change rapidly, capture fine-grained positional differences
- Combining multiple frequency dimensions so model can uniquely represent each position in sequence, encode both absolute and relative position information
- Benefits of YARN:
- better capture of fine-grained local attention
- coarse-grained global focus
Training and Post-Training Details
- Pre-training process: done on text-only data and applied the CBRN filter
- CBRN stands for "Chemical, Biological, Radiological, and Nuclear"
- Training time:
- 2 million GPU hours for 120B model
- 100 GPU hours for the 20B model
- Post-training process: reinforcement learning with the "train of thought" format
- New Harmony chat format:
- newly introduced response format for gpt-oss models
- structures conversations with clear roles, channels, and special tokens—designed for better interoperability, reasoning separation, and tool integration
- variable reasoning effort levels: model’s ability to adjust amount of computational/cognitive effort it expends when solving different tasks or answering different questions
- allocate more resources (such as more layers, more steps, or more complex chains of thought) to harder problems
- implemented through mechanisms like:
- dynamic depth (of layers)
- adaptive computation
- explicit reasoning steps (e.g., chain-of-thought prompting)
Top comments (0)