DEV Community

Ricky
Ricky

Posted on

Deep Dive: OpenAI's GPT-OSS

Deep Dive: OpenAI's GPT-OSS

(These are my memos taken during the talk. Posted with permission.)

Overview

  • Deep dive into GPT OSS, an open-source model from OpenAI, including 120B and 20B variants.
  • Performance efficiency:
    • 120B model can run on a single 80GB VRAM GPU
    • 20B model can run locally on laptop
  • Key architectural features
    • group query attention
    • sliding window attention
    • learned biases
  • Pre-trained on text-only data
  • Post-trained using reinforcement learning

Outline

Introduction and Overview of GPT OSS

  • GPT OSS: the first reasoning open-weight model from OpenAI
  • Two size variants: 120B and 20B
  • Technical highlight details of the models:
    • Mixture of Experts (MoE) architecture
    • ability to run on a single 80GB VRAM GPU

Architecture and Normalization of GPT OSS

  • Architecture: standard SOTA LLM architecture with transform blocks and attention blocks.
  • MoE (Mixtral of Experts) part of the transformer layer: activation in speed loop and quantization to 4.25 bits per parameter
  • Importance of normalization in deep neural networks, compared with Layer Norm and RMS Norm
    • Layer Norm: Originally Used in original Transformer architecture
    • RMS Norm:
      • computationally cheaper (removing the mean subtraction step)
      • provides similar quality to Layer Norm
  • Placement of normalization in the initial transformer, contrasting post-norm and pre-norm
    • Post-norm
      • applies normalization after the residual connection
      • stabilizes each layer’s output
      • can make training deep models more difficult, requiring careful learning rate warmup and risking vanishing gradients
    • Pre-norm
      • applies normalization before the attention and MOE blocks
      • more stable for training very deep models
      • less sensitive to learning rate schedules or warmup
      • preferred choice in modern large language models, GPT-OSS uses this

Algorithm Gems of the Attention Block

  • Group Query Attention (GQA): splits the query head into groups and shares one set of key-value (KV) heads per group
    • Compared to Multi-Head Attention (MHA) from original Transformer model: reduce number of KV parameters and memory required for the KV cache
    • allows each group to share a set of KV heads -> significantly lower memory and compute requirements during inference
  • Sliding window attention: a local, sparse self-attention pattern that reduces computational complexity
    • allows large language models to efficiently process very long sequences of text
    • restricts each token to attend only to a fixed window of neighboring tokens
    • computation scales linearly with the length of the input
    • reduce memory and compute requirements
    • GPT OSS can handle extremely long contexts (up to 128k tokens) that would be impractical with standard attention mechanisms
  • Alternating between dense and sliding window layers in GPT OSS allows the model to focus on both local and global information

Context Length Extension and Positional Encoding

  • Added learned bias added to the attention block enables the model to pay no attention to any tokens
    • allocate some of its attention probability mass to a special "dummy" position, rather than being forced to distribute it among the actual tokens
    • good for an attention head to be able to pay no attention because this allows the model to avoid focusing on irrelevant or unhelpful information
  • Importance of positional encoding: comparing sinusoidal position encoding and Rotary Position Embedding (RoPE) position encoding
    • Positional encoding helps model distinguish "dog bites man" from "man bites dog"
    • Sinusoidal position encoding:
      • adds a unique vector of sines and cosines to each token’s embedding, providing absolute position information
      • allows the model to learn about word order and distance
      • limited in how well it can generalize to longer sequences than it was trained on
    • RoPE (Rotary Position Embedding):
      • improves by encoding position information through a rotation matrix applied to the query and key vectors
      • captures relative position information more directly and preserves the norm of the vectors, helps with training stability
      • enables better extrapolation to longer sequences
      • de facto standard in modern large language models.
  • YARN (Yet Another RoPE Extension) improves RoPE's performance at long contexts by scaling the frequency dimensions.
    • scales the frequency dimensions used in the positional encoding
    • standard RoPE: high-frequency components can "wrap around" too quickly when the context length is much longer than what model was trained on -> cause position information to become ambiguous or unstable
    • YARN adjusts (scales) frequency dimensions so positional encoding remains meaningful and stable even for very long sequences
    • Scaling prevents rapid repetition of high-frequency components -> allowing model to better handle and reason over long contexts without losing track of position information
    • frequency dimension: component of the encoding vector that oscillates at specific frequency
      • In sinusoidal and rotary positional encodings (like RoPE), each position in the sequence is mapped to vector where each element (or pair of elements) varies sinusoidally with a different frequency
      • Lower frequency dimensions change slowly across positions, capture broad positional trends
      • Higher frequency dimensions change rapidly, capture fine-grained positional differences
      • Combining multiple frequency dimensions so model can uniquely represent each position in sequence, encode both absolute and relative position information
  • Benefits of YARN:
    • better capture of fine-grained local attention
    • coarse-grained global focus

Training and Post-Training Details

  • Pre-training process: done on text-only data and applied the CBRN filter
    • CBRN stands for "Chemical, Biological, Radiological, and Nuclear"
  • Training time:
    • 2 million GPU hours for 120B model
    • 100 GPU hours for the 20B model
  • Post-training process: reinforcement learning with the "train of thought" format
  • New Harmony chat format:
    • newly introduced response format for gpt-oss models
    • structures conversations with clear roles, channels, and special tokens—designed for better interoperability, reasoning separation, and tool integration
    • variable reasoning effort levels: model’s ability to adjust amount of computational/cognitive effort it expends when solving different tasks or answering different questions
      • allocate more resources (such as more layers, more steps, or more complex chains of thought) to harder problems
      • implemented through mechanisms like:
        • dynamic depth (of layers)
        • adaptive computation
        • explicit reasoning steps (e.g., chain-of-thought prompting)

Top comments (0)