DEV Community: Vuk Rosić

DeepSeek DeepGEMM 中文讲解

Vuk Rosić — Mon, 07 Jul 2025 18:42:19 +0000

这个仓库是 DeepGEMM，一个专注于高效 FP8（8位浮点数）通用矩阵乘法（GEMM）的库，由 DeepSeek 团队开发。它支持细粒度的缩放（fine-grained scaling）和混合专家（MoE）模型中的分组 GEMM 运算。

主要特点

FP8 GEMM 支持
- 专为 NVIDIA Hopper 架构（如 H100 GPU）优化，利用 Tensor Core 进行高性能计算。
- 由于 FP8 运算精度较低，采用 CUDA 核心进行两级累加（promotion）来保证计算精度。
分组 GEMM（Grouped GEMM）
- 支持 连续布局（contiguous layout） 和 掩码布局（masked layout），适用于 MoE 模型训练和推理。
- 连续布局适用于训练或推理预填充阶段，而掩码布局适用于解码阶段（如 CUDA Graph 场景）。
权重梯度计算（Weight Gradient Kernels）
- 支持 密集模型（Dense） 和 MoE 模型 的反向传播计算。
轻量级 JIT（Just-In-Time）编译
- 无需安装时编译，所有 CUDA 核心在运行时动态编译，减少部署复杂度。
- 支持 NVCC 和 NVRTC（NVIDIA Runtime Compiler），后者编译速度更快（最高 10 倍）。
高性能优化
- 采用 Hopper TMA（Tensor Memory Accelerator） 进行异步数据加载和存储。
- 持久化 Warp 专业化（Persistent Warp-Specialization），优化数据移动和计算重叠。
- FFMA SASS 指令交错（Interleaving），提升 FP8 运算效率。
- 非对齐块大小（Unaligned Block Sizes），提高 SM（流式多处理器）利用率。
简洁的代码设计
- 仅有一个核心 GEMM 核函数，便于理解和优化。

应用场景

大模型训练与推理（如 Transformer、MoE 架构）。
高性能计算（HPC），需要低精度（FP8）矩阵运算的场景。
需要动态调整计算形状的任务（如变长序列处理）。

快速开始

环境要求

GPU: NVIDIA Hopper 架构（如 H100，需 sm_90a 支持）。
CUDA: 12.3+（推荐 12.8+ 以获得最佳性能）。
Python: 3.8+。
PyTorch: 2.1+。
CUTLASS: 3.6+（通过 Git 子模块引入）。

安装

git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git
python setup.py develop  # 开发模式
python setup.py install  # 安装

使用示例

import deep_gemm

# 普通 FP8 GEMM
lhs = (torch.randn(4096, 7168, dtype=torch.float8_e4m3fn), torch.randn(4096, 56, dtype=torch.float32))  # [M, K], [M, K//128]
rhs = (torch.randn(2112, 7168, dtype=torch.float8_e4m3fn), torch.randn(17, 56, dtype=torch.float32))    # [N, K], [N//128, K//128]
out = torch.empty(4096, 2112, dtype=torch.bfloat16, device="cuda")
deep_gemm.gemm_fp8_fp8_bf16_nt(lhs, rhs, out)

# MoE 分组 GEMM（连续布局）
m_indices = torch.randint(0, 4, (8192,), device="cuda")  # 分组索引
deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(lhs, rhs, out, m_indices)

性能表现

在 H800 GPU 上达到 1550 TFLOPS（FP8 算力）。
性能优于或持平专家调优的库（如 CUTLASS）。

未来计划

支持 BF16 运算。
更大的块大小（N 维度扩展至 256）。
优化功耗效率。
支持 CUDA PDL（Pattern Description Language）。

总结

DeepGEMM 是一个高效、易用的 FP8 GEMM 库，特别适合大模型和 MoE 架构的计算优化。它的 JIT 编译设计和细粒度优化使其在多种矩阵形状下都能保持高性能。

论文引用

如需引用，请参考仓库中的 BibTeX 格式。

在 大语言模型（LLMs） 的训练和推理过程中，DeepGEMM 可以替代以下关键矩阵乘法（GEMM）运算，显著提升计算效率，尤其是在 FP8 低精度计算 和 混合专家（MoE）模型 场景下：

1. 全连接层（Feed-Forward Network, FFN）

标准 FFN：

LLM 中的全连接层（如 W_in @ X 和 W_out @ (GELU(W_gate @ X))）通常占计算量的 30%~50%。

替换方案：

使用 gemm_fp8_fp8_bf16_nt 替代 FP16/BF16 的 torch.matmul，利用 FP8 计算加速，同时通过两级累加保持精度。
MoE 模型的专家层：

MoE 模型中，不同 token 被路由到不同专家（如 Expert_i @ X），计算是稀疏的。

替换方案：

使用分组 GEMM（m_grouped_gemm_fp8_fp8_bf16_nt_contiguous 或 m_grouped_gemm_fp8_fp8_bf16_nt_masked），将多个专家的计算合并为一次批处理，减少启动开销。

2. 注意力机制（Attention）

QKV 投影（Q/K/V Matrices）：

计算 Q = X @ W_Q, K = X @ W_K, V = X @ W_V 时，可用 gemm_fp8_fp8_bf16_nt 加速。

注意：Softmax 仍需 FP16/BF16 计算（FP8 精度不足）。
注意力得分（Attention Scores）：

S = Q @ K^T 理论上可用 FP8 GEMM，但需配合缩放因子（因 FP8 动态范围小）。DeepGEMM 的细粒度缩放支持可能适用。

3. 反向传播中的梯度计算

权重梯度（Weight Gradients）：

在反向传播中，计算 dW = X^T @ dY（如 FFN 层的 W.grad）。

替换方案：

使用 wgrad_gemm_fp8_fp8_fp32_nt，直接以 FP8 输入计算 FP32 梯度，避免显式类型转换。
MoE 的专家梯度聚合：

使用 k_grouped_wgrad_gemm_fp8_fp8_fp32_nt 处理不同专家的梯度，避免逐专家计算。

4. 其他场景

嵌入层（Embedding）的投影：如 logits = hidden_states @ embedding_matrix.T，可用 FP8 GEMM 加速（需对齐维度）。
LoRA 适配器的低秩乘法： LoRA 的 A @ B 低秩矩阵乘法，适合 FP8 计算。

适用条件

硬件要求：需 NVIDIA Hopper GPU（如 H100），支持 FP8 Tensor Core。
精度权衡：FP8 可能影响模型收敛性，建议在训练后期或推理阶段使用。
形状对齐：输入矩阵的 K 维度需对齐 128（DeepGEMM 的 BLOCK_K=128），否则需填充。

性能收益

速度：FP8 理论算力是 BF16 的 4 倍（实际加速 2~3 倍）。
显存：FP8 数据占用显存仅为 BF16 的 1/4，可处理更大 batch size。
MoE 优化：分组 GEMM 减少核函数启动次数，提升吞吐量。

示例代码（替换 HuggingFace 模型中的线性层）

from transformers import AutoModel
import deep_gemm

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b").cuda()

# 替换 FFN 层的矩阵乘法
def fp8_linear(x, weight, scales_x, scales_w):
    # x: [seq_len, hidden_dim], weight: [hidden_dim, out_dim]
    x_fp8 = (x.to(torch.float8_e4m3fn), scales_x)  # 伪代码，需实际量化
    w_fp8 = (weight.to(torch.float8_e4m3fn), scales_w)
    out = torch.empty(x.shape[0], weight.shape[1], dtype=torch.bfloat16, device="cuda")
    deep_gemm.gemm_fp8_fp8_bf16_nt(x_fp8, w_fp8, out)
    return out

# 替换原模型的 forward 计算
model.layers[0].mlp.dense_h_to_4h = fp8_linear

注意事项

精度校准：FP8 需要动态缩放因子（DeepGEMM 已内置），建议在训练时统计 amax。
核函数选择：小矩阵（如 M < 128）可能无法充分利用 Tensor Core，需测试性能。

通过合理替换，DeepGEMM 可显著加速 LLM 的 训练迭代 和 推理延迟，尤其适合 MoE 或超大模型场景。

Code a Neural Network from Scratch in NumPy

Vuk Rosić — Mon, 07 Jul 2025 09:20:09 +0000

Part of my Zero to AI Researcher / Engineer Course

Part 1: Getting Started - Your First Neural Network

import numpy as np
import matplotlib.pyplot as plt

# Create simple dataset - XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

print(f"Input shape: {X.shape}")    # (4, 2)
print(f"Output shape: {y.shape}")   # (4, 1)
print(f"Dataset:")
for i in range(len(X)):
    print(f"  {X[i]} -> {y[i][0]}")  # Print each input pair and its expected output

What happened: We created the XOR dataset - a classic non-linearly separable problem that requires a neural network to solve.

Part 2: Understanding the Math - Forward Pass

Basic Network Architecture

# Network architecture: 2 -> 4 -> 1 (input -> hidden -> output)
input_size = 2      # Number of input features (x and y coordinates)
hidden_size = 4     # Number of neurons in hidden layer
output_size = 1     # Number of output values (0 or 1)

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size) * 0.5  # Weights: connection strengths between layers
b1 = np.zeros((1, hidden_size))                      # Biases: adjustable offsets for each neuron
W2 = np.random.randn(hidden_size, output_size) * 0.5 # Weights for output layer
b2 = np.zeros((1, output_size))                      # Bias for output layer

print(f"W1 shape: {W1.shape}")  # (2, 4)
print(f"b1 shape: {b1.shape}")  # (1, 4)
print(f"W2 shape: {W2.shape}")  # (4, 1)
print(f"b2 shape: {b2.shape}")  # (1, 1)

OPTIONAL: Understanding Matrix Shapes in Neural Networks

# Let's trace through the shapes step by step
print("Shape flow through network:")
print(f"Input X:        {X.shape}")           # (4, 2)
print(f"Weights W1:     {W1.shape}")          # (2, 4)
print(f"X @ W1:         {(X @ W1).shape}")    # (4, 4)
print(f"Bias b1:        {b1.shape}")          # (1, 4)

# Broadcasting explanation
sample_mult = X @ W1
print(f"\nBroadcasting bias:")
print(f"(X @ W1) shape: {sample_mult.shape}")  # (4, 4)
print(f"b1 shape:       {b1.shape}")           # (1, 4)
print(f"Result shape:   {(sample_mult + b1).shape}")  # (4, 4)

Forward Pass Implementation

def forward_pass(X, W1, b1, W2, b2):
    """Complete forward pass through the network"""
    # Hidden layer
    z1 = X @ W1 + b1              # Linear transformation
    a1 = sigmoid(z1)              # Activation

    # Output layer
    z2 = a1 @ W2 + b2            # Linear transformation
    a2 = sigmoid(z2)             # Activation

    return z1, a1, z2, a2

# Test forward pass
z1, a1, z2, predictions = forward_pass(X, W1, b1, W2, b2)
print(f"Predictions shape: {predictions.shape}")
print(f"Predictions:\n{predictions.flatten()}")
print(f"Actual labels:\n{y.flatten()}")

OPTIONAL: Understanding the Sigmoid Function

def sigmoid(x):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))  # Clip to prevent overflow

# Test sigmoid on different inputs
test_values = np.array([-10, -1, 0, 1, 10])
sigmoid_values = sigmoid(test_values)

print("Sigmoid function behavior:")
for i, val in enumerate(test_values):
    print(f"  sigmoid({val:3.0f}) = {sigmoid_values[i]:.3f}")

# Visualize sigmoid
x_range = np.linspace(-10, 10, 100)
y_sigmoid = sigmoid(x_range)
plt.figure(figsize=(8, 4))
plt.plot(x_range, y_sigmoid, 'b-', linewidth=2)
plt.title('Sigmoid Activation Function')
plt.xlabel('x')
plt.ylabel('sigmoid(x)')
plt.grid(True, alpha=0.3)
plt.show()

OPTIONAL: Breaking Down the Sigmoid Formula

# Let's understand sigmoid step by step: 1 / (1 + e^(-x))
x = 2.0
print(f"Input: x = {x}")
print(f"Step 1: -x = {-x}")
print(f"Step 2: e^(-x) = np.exp(-x) = {np.exp(-x):.3f}")
print(f"Step 3: 1 + e^(-x) = {1 + np.exp(-x):.3f}")
print(f"Step 4: 1 / (1 + e^(-x)) = {1 / (1 + np.exp(-x)):.3f}")
print(f"Sigmoid result: {sigmoid(x):.3f}")

# Why clipping?
print(f"\nWhy we clip large values:")
large_x = 1000
print(f"Without clipping: np.exp(-{large_x}) would cause overflow")
print(f"With clipping: np.exp(-500) = {np.exp(-500):.2e} (very small, safe)")

Key insight: The forward pass transforms input through weighted connections and activations to produce predictions.

Part 3: Computing Loss and Gradients

Loss Function

def compute_loss(predictions, targets):
    """Mean squared error loss"""
    return np.mean((predictions - targets) ** 2)  # Square the difference to penalize large errors

OPTIONAL: Why We Square the Difference

# Let's see why we use (predictions - targets) ** 2
pred = np.array([0.8, 0.2, 0.9, 0.1])
target = np.array([1.0, 0.0, 1.0, 0.0])

differences = pred - target
squared_differences = differences ** 2

print("Understanding squared error:")
print(f"Predictions: {pred}")
print(f"Targets:     {target}")
print(f"Differences: {differences}")
print(f"Squared:     {squared_differences}")
print(f"Mean squared error: {np.mean(squared_differences):.4f}")

# Why not just absolute difference?
abs_differences = np.abs(differences)
print(f"\nComparison:")
print(f"Absolute differences: {abs_differences}")
print(f"Squared differences:  {squared_differences}")
print("Squared errors penalize large mistakes more heavily!")

Calculate initial loss

initial_loss = compute_loss(predictions, y)
print(f"Initial loss: {initial_loss:.4f}")

OPTIONAL: Understanding Derivative of Sigmoid

def sigmoid_derivative(x):
    """Derivative of sigmoid function"""
    s = sigmoid(x)
    return s * (1 - s)  # Derivative formula: sigmoid(x) * (1 - sigmoid(x))

# Test derivative
test_vals = np.array([-2, -1, 0, 1, 2])
sigmoid_vals = sigmoid(test_vals)
derivative_vals = sigmoid_derivative(test_vals)

print("Sigmoid and its derivative:")
for i, val in enumerate(test_vals):
    print(f"  x={val:2.0f}: sigmoid={sigmoid_vals[i]:.3f}, derivative={derivative_vals[i]:.3f}")

# Visualize both functions
x_range = np.linspace(-6, 6, 100)
y_sigmoid = sigmoid(x_range)
y_derivative = sigmoid_derivative(x_range)

plt.figure(figsize=(10, 4))
plt.plot(x_range, y_sigmoid, 'b-', label='sigmoid(x)', linewidth=2)
plt.plot(x_range, y_derivative, 'r--', label="sigmoid'(x)", linewidth=2)
plt.title('Sigmoid Function and Its Derivative')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

OPTIONAL: Chain Rule in Backpropagation

# The chain rule: if y = f(g(x)), then dy/dx = f'(g(x)) * g'(x)
# 
# For our network: Loss = MSE(sigmoid(W2 * sigmoid(W1 * X + b1) + b2), y)
# We need: dLoss/dW2, dLoss/db2, dLoss/dW1, dLoss/db1

print("Chain rule breakdown:")
print("dLoss/dW2 = dLoss/da2 * da2/dz2 * dz2/dW2")
print("  where:")
print("    dLoss/da2 = 2 * (predictions - targets)  # MSE derivative")
print("    da2/dz2 = sigmoid'(z2)                   # sigmoid derivative")
print("    dz2/dW2 = a1                             # linear layer derivative")

Backpropagation Implementation

def backward_pass(X, y, z1, a1, z2, a2, W1, b1, W2, b2):
    """Compute gradients using backpropagation"""
    m = X.shape[0]  # Number of samples

    # Output layer gradients
    dz2 = 2 * (a2 - y) * sigmoid_derivative(z2)  # (4, 1)
    dW2 = a1.T @ dz2 / m                         # (4, 1)
    db2 = np.mean(dz2, axis=0, keepdims=True)    # (1, 1)

    # Hidden layer gradients
    dz1 = (dz2 @ W2.T) * sigmoid_derivative(z1)  # (4, 4)
    dW1 = X.T @ dz1 / m                          # (2, 4)
    db1 = np.mean(dz1, axis=0, keepdims=True)    # (1, 4)

    return dW1, db1, dW2, db2

# Test backpropagation
dW1, db1, dW2, db2 = backward_pass(X, y, z1, a1, z2, predictions, W1, b1, W2, b2)
print(f"Gradient shapes:")
print(f"  dW1: {dW1.shape}, dW2: {dW2.shape}")
print(f"  db1: {db1.shape}, db2: {db2.shape}")

OPTIONAL: Understanding Output Layer Gradients

# Let's break down the output layer gradient calculation
print("Output layer gradient breakdown:")
print("dz2 = 2 * (a2 - y) * sigmoid_derivative(z2)")

# Step by step
error = a2 - y  # How far off our predictions are
print(f"Error (a2 - y) shape: {error.shape}")
print(f"Error values:\n{error.flatten()}")

mse_gradient = 2 * error  # Derivative of MSE
print(f"\nMSE gradient (2 * error) shape: {mse_gradient.shape}")

sigmoid_grad = sigmoid_derivative(z2)  # Derivative of sigmoid
print(f"Sigmoid gradient shape: {sigmoid_grad.shape}")

dz2_step = mse_gradient * sigmoid_grad  # Chain rule
print(f"Combined gradient (dz2) shape: {dz2_step.shape}")

OPTIONAL: Understanding Hidden Layer Gradients

# Hidden layer gradients are more complex due to chain rule
print("Hidden layer gradient breakdown:")
print("dz1 = (dz2 @ W2.T) * sigmoid_derivative(z1)")

# Step by step
error_propagated = dz2 @ W2.T  # Propagate error backwards
print(f"Error propagated shape: {error_propagated.shape}")
print(f"This spreads output error to each hidden neuron")

hidden_sigmoid_grad = sigmoid_derivative(z1)  # Local gradient
print(f"Hidden sigmoid gradient shape: {hidden_sigmoid_grad.shape}")

dz1_step = error_propagated * hidden_sigmoid_grad  # Final gradient
print(f"Combined hidden gradient (dz1) shape: {dz1_step.shape}")

OPTIONAL: Understanding Weight Gradients

# Weight gradients show how to adjust connections
print("Weight gradient calculation:")
print("dW2 = a1.T @ dz2 / m")

print(f"a1.T shape: {a1.T.shape}")  # Transposed hidden activations
print(f"dz2 shape: {dz2.shape}")    # Output gradients
print(f"dW2 shape: {(a1.T @ dz2).shape}")  # Weight gradients

# This gives us the gradient for each weight connection
print(f"\nWeight gradients tell us:")
print(f"- Positive gradient: decrease this weight")
print(f"- Negative gradient: increase this weight")
print(f"- Large gradient: this weight has big impact on error")

Critical concept: Backpropagation uses the chain rule to compute how much each weight contributes to the total error.

Part 4: Training the Network

Training Loop

def train_network(X, y, epochs=1000, learning_rate=1.0):
    """Train the neural network"""
    # Initialize weights
    np.random.seed(42)
    W1 = np.random.randn(2, 4) * 0.5
    b1 = np.zeros((1, 4))
    W2 = np.random.randn(4, 1) * 0.5
    b2 = np.zeros((1, 1))

    losses = []

    for epoch in range(epochs):
        # Forward pass
        z1, a1, z2, predictions = forward_pass(X, W1, b1, W2, b2)

        # Compute loss
        loss = compute_loss(predictions, y)
        losses.append(loss)

        # Backward pass
        dW1, db1, dW2, db2 = backward_pass(X, y, z1, a1, z2, predictions, W1, b1, W2, b2)

        # Update weights
        W1 -= learning_rate * dW1
        b1 -= learning_rate * db1
        W2 -= learning_rate * dW2
        b2 -= learning_rate * db2

        # Print progress
        if epoch % 100 == 0:
            print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")

    return W1, b1, W2, b2, losses

# Train the network
W1_trained, b1_trained, W2_trained, b2_trained, loss_history = train_network(X, y)

OPTIONAL: Understanding Learning Rate

# Learning rate controls how big steps we take during optimization
# Too small: slow convergence, too large: might overshoot minimum

learning_rates = [0.1, 1.0, 10.0]
plt.figure(figsize=(12, 4))

for i, lr in enumerate(learning_rates):
    plt.subplot(1, 3, i+1)

    # Train with this learning rate
    _, _, _, _, losses = train_network(X, y, epochs=500, learning_rate=lr)

    plt.plot(losses)
    plt.title(f'Learning Rate = {lr}')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.yscale('log')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

OPTIONAL: Visualizing Training Progress

# Plot loss curve
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(loss_history)
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.yscale('log')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(loss_history[-100:])  # Last 100 epochs
plt.title('Training Loss (Last 100 Epochs)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final loss: {loss_history[-1]:.6f}")

Part 5: Testing the Trained Network

Final Predictions

# Test the trained network
z1_final, a1_final, z2_final, final_predictions = forward_pass(X, W1_trained, b1_trained, W2_trained, b2_trained)

print("Final Results:")
print("Input -> Target | Prediction | Rounded")
print("-" * 40)
for i in range(len(X)):
    pred = final_predictions[i, 0]
    rounded = round(pred)
    target = y[i, 0]
    print(f"{X[i]} -> {target}      | {pred:.4f}    | {rounded}")

# Calculate accuracy
rounded_predictions = np.round(final_predictions)
accuracy = np.mean(rounded_predictions == y)
print(f"\nAccuracy: {accuracy:.1%}")

OPTIONAL: Visualizing Decision Boundary

# Create a grid of points to visualize the decision boundary
def plot_decision_boundary(W1, b1, W2, b2):
    """Plot the decision boundary learned by the network"""
    # Create a grid
    xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100),
                         np.linspace(-0.5, 1.5, 100))

    # Flatten the grid for prediction
    grid_points = np.c_[xx.ravel(), yy.ravel()]

    # Make predictions on the grid
    _, _, _, grid_predictions = forward_pass(grid_points, W1, b1, W2, b2)
    grid_predictions = grid_predictions.reshape(xx.shape)

    # Plot
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, grid_predictions, levels=50, alpha=0.8, cmap='RdYlBu')
    plt.colorbar(label='Network Output')

    # Plot data points
    colors = ['red' if label == 0 else 'blue' for label in y.flatten()]
    plt.scatter(X[:, 0], X[:, 1], c=colors, s=100, edgecolors='black', linewidth=2)

    # Add labels
    for i, (x, y_val) in enumerate(zip(X, y.flatten())):
        plt.annotate(f'({x[0]},{x[1]})→{y_val}', 
                    (x[0], x[1]), xytext=(5, 5), textcoords='offset points')

    plt.title('Neural Network Decision Boundary')
    plt.xlabel('Input 1')
    plt.ylabel('Input 2')
    plt.grid(True, alpha=0.3)
    plt.show()

# Visualize the decision boundary
plot_decision_boundary(W1_trained, b1_trained, W2_trained, b2_trained)

Part 6: Understanding What We Built

Complete Neural Network Class

class SimpleNeuralNetwork:
    """A simple 2-layer neural network implementation"""

    def __init__(self, input_size=2, hidden_size=4, output_size=1):
        # Initialize weights
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]

        # Output layer gradients
        dz2 = 2 * (self.a2 - y) * self.sigmoid(self.z2) * (1 - self.sigmoid(self.z2))
        dW2 = self.a1.T @ dz2 / m
        db2 = np.mean(dz2, axis=0, keepdims=True)

        # Hidden layer gradients
        dz1 = (dz2 @ self.W2.T) * self.sigmoid(self.z1) * (1 - self.sigmoid(self.z1))
        dW1 = X.T @ dz1 / m
        db1 = np.mean(dz1, axis=0, keepdims=True)

        return dW1, db1, dW2, db2

    def train(self, X, y, epochs=1000, learning_rate=1.0):
        losses = []

        for epoch in range(epochs):
            # Forward pass
            predictions = self.forward(X)

            # Compute loss
            loss = np.mean((predictions - y) ** 2)
            losses.append(loss)

            # Backward pass
            dW1, db1, dW2, db2 = self.backward(X, y)

            # Update weights
            self.W1 -= learning_rate * dW1
            self.b1 -= learning_rate * db1
            self.W2 -= learning_rate * dW2
            self.b2 -= learning_rate * db2

            if epoch % 100 == 0:
                print(f"Epoch {epoch:4d}: Loss = {loss:.6f}")

        return losses

    def predict(self, X):
        return self.forward(X)

# Test the class
nn = SimpleNeuralNetwork()
losses = nn.train(X, y, epochs=1000, learning_rate=1.0)
predictions = nn.predict(X)

print("\nClass-based Neural Network Results:")
for i in range(len(X)):
    pred = predictions[i, 0]
    target = y[i, 0]
    print(f"{X[i]} -> {target} | Prediction: {pred:.4f} | Rounded: {round(pred)}")

OPTIONAL: Comparing with Different Architectures

# Test different hidden layer sizes
hidden_sizes = [2, 4, 8, 16]
results = {}

for hidden_size in hidden_sizes:
    print(f"\nTesting hidden size: {hidden_size}")
    nn = SimpleNeuralNetwork(input_size=2, hidden_size=hidden_size, output_size=1)
    losses = nn.train(X, y, epochs=1000, learning_rate=1.0)
    predictions = nn.predict(X)

    # Calculate accuracy
    accuracy = np.mean(np.round(predictions) == y)
    results[hidden_size] = {
        'final_loss': losses[-1],
        'accuracy': accuracy,
        'predictions': predictions
    }

    print(f"  Final loss: {losses[-1]:.6f}")
    print(f"  Accuracy: {accuracy:.1%}")

# Plot comparison
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
hidden_sizes_list = list(results.keys())
final_losses = [results[hs]['final_loss'] for hs in hidden_sizes_list]
plt.bar(hidden_sizes_list, final_losses)
plt.title('Final Loss vs Hidden Size')
plt.xlabel('Hidden Layer Size')
plt.ylabel('Final Loss')
plt.yscale('log')

plt.subplot(1, 2, 2)
accuracies = [results[hs]['accuracy'] for hs in hidden_sizes_list]
plt.bar(hidden_sizes_list, accuracies)
plt.title('Accuracy vs Hidden Size')
plt.xlabel('Hidden Layer Size')
plt.ylabel('Accuracy')
plt.ylim(0, 1)

plt.tight_layout()
plt.show()

Key takeaway: You've built a complete neural network from scratch! The network learns to solve the XOR problem by discovering the right weights and biases through gradient descent.

Summary

You've successfully implemented:

Forward propagation: Computing predictions from inputs
Loss computation: Measuring how wrong the predictions are
Backpropagation: Computing gradients using the chain rule
Weight updates: Using gradient descent to improve the network
Complete training loop: Putting it all together

The neural network learns by repeatedly adjusting its weights based on the errors it makes, eventually discovering the complex decision boundary needed to solve the XOR problem.

DeepGEMM Essentials: High-Performance FP8 Matrix Multiplication

Vuk Rosić — Mon, 07 Jul 2025 09:19:46 +0000

DeepGEMM Essentials: High-Performance FP8 Matrix Multiplication

Google Colab

Master these concepts and you'll be able to leverage cutting-edge FP8 acceleration on Hopper H1000, H200 & H800 GOUs!

Part 1: Getting Started - Your First FP8 GEMM

import torch
import deep_gemm

# Create simple input matrices
m, n, k = 128, 256, 512
lhs = torch.randn((m, k), device='cuda', dtype=torch.bfloat16)
rhs = torch.randn((n, k), device='cuda', dtype=torch.bfloat16)
output = torch.empty((m, n), device='cuda', dtype=torch.bfloat16)

print(f"LHS shape: {lhs.shape}")    # [128, 512]
print(f"RHS shape: {rhs.shape}")    # [256, 512] 
print(f"Output shape: {output.shape}")  # [128, 256]

What happened: We created the basic tensors for matrix multiplication: LHS × RHS^T = Output.

Part 2: Understanding FP8 - Why It Matters

# .numel() returns the total number of elements in a tensor
small_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(f"Small tensor shape: {small_tensor.shape}")
print(f"Small tensor elements: {small_tensor.numel()}")  # 2 * 3 = 6

matrix_2d = torch.randn(50, 20)
print(f"2D matrix elements: {matrix_2d.numel()}")  # 50 * 20 = 1000

# Our actual tensors
print(f"LHS elements: {lhs.numel()}")  # 128 * 512 = 65,536
print(f"RHS elements: {rhs.numel()}")  # 256 * 512 = 131,072

# Regular BF16 GEMM (what you normally do)
reference = lhs @ rhs.t()  # Standard PyTorch GEMM

# Check memory usage
bf16_memory = lhs.numel() * 2 + rhs.numel() * 2  # 2 bytes per BF16
fp8_memory = lhs.numel() * 1 + rhs.numel() * 1   # 1 byte per FP8

print(f"BF16 memory: {bf16_memory / 1024**2:.1f} MB")
print(f"FP8 memory: {fp8_memory / 1024**2:.1f} MB")
print(f"Memory saved: {(1 - fp8_memory/bf16_memory)*100:.1f}%")

Key insight: FP8 uses half the memory while maintaining good accuracy with proper scaling.

Part 3: Converting to FP8 with Scaling

def cast_to_fp8_per_token(x: torch.Tensor):
    """Convert tensor to FP8 with per-token (per-row) scaling"""
    assert x.dim() == 2
    m, n = x.shape

    # Pad to 128-element boundaries (FP8 requirement)
    pad_size = (128 - (n % 128)) % 128
    if pad_size > 0:
        x = torch.nn.functional.pad(x, (0, pad_size), value=0)

    # Reshape for scaling calculation
    x_view = x.view(m, -1, 128)  # [m, n/128, 128]

    # Find max absolute value per 128-element block
    x_amax = x_view.abs().float().amax(dim=2).clamp(1e-4)  # [m, n/128]

    # Scale to FP8 range (448.0 is max representable value)
    fp8_data = (x_view * (448.0 / x_amax.unsqueeze(2))).to(torch.float8_e4m3fn)
    scale_factors = (x_amax / 448.0)

    return fp8_data.view(m, -1)[:, :n], scale_factors

# Convert our matrices
lhs_fp8, lhs_scales = cast_to_fp8_per_token(lhs)
print(f"Original: {lhs.dtype}, Converted: {lhs_fp8.dtype}")
print(f"Scale factors shape: {lhs_scales.shape}")

Critical concept: Scaling prevents overflow and maintains precision in the limited FP8 range.

Part 4: Block-wise Scaling for RHS

def cast_to_fp8_per_block(x: torch.Tensor):
    """Convert tensor to FP8 with per-block scaling (128x128 blocks)"""
    m, n = x.shape

    # Pad to 128x128 blocks
    padded_m = ((m + 127) // 128) * 128
    padded_n = ((n + 127) // 128) * 128

    x_padded = torch.zeros((padded_m, padded_n), dtype=x.dtype, device=x.device)
    x_padded[:m, :n] = x

    # Reshape into 128x128 blocks
    x_view = x_padded.view(-1, 128, x_padded.size(1) // 128, 128)

    # Find max per block
    x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4)

    # Scale to FP8
    x_scaled = (x_view * (448.0 / x_amax)).to(torch.float8_e4m3fn)
    scale_factors = (x_amax / 448.0).view(x_view.size(0), x_view.size(2))

    return x_scaled.view_as(x_padded)[:m, :n], scale_factors

# Convert RHS with block scaling
rhs_fp8, rhs_scales = cast_to_fp8_per_block(rhs)
print(f"RHS FP8 shape: {rhs_fp8.shape}")
print(f"RHS scales shape: {rhs_scales.shape}")

Why different scaling: LHS uses fine-grained scaling, RHS uses coarser blocks for efficiency.

Part 5: Preparing Tensors for DeepGEMM

# DeepGEMM requires specific tensor layouts
from deep_gemm import get_col_major_tma_aligned_tensor

# LHS scales must be transposed and TMA-aligned
lhs_scales_aligned = get_col_major_tma_aligned_tensor(lhs_scales)

# RHS scales must be contiguous
assert rhs_scales.is_contiguous()

# Package the inputs
lhs_input = (lhs_fp8, lhs_scales_aligned)
rhs_input = (rhs_fp8, rhs_scales)

print("✓ Tensors prepared for DeepGEMM")
print(f"LHS scales alignment: {lhs_scales_aligned.stride()}")

TMA requirement: Tensor Memory Accelerator needs specific memory alignment for optimal performance.

Part 6: Your First DeepGEMM Call

# Perform the FP8 GEMM
deep_gemm.gemm_fp8_fp8_bf16_nt(lhs_input, rhs_input, output)

# Verify correctness
reference = lhs @ rhs.t()
error = torch.abs(output - reference).max().item()
relative_error = (error / torch.abs(reference).max().item()) * 100

print(f"Max absolute error: {error:.6f}")
print(f"Relative error: {relative_error:.3f}%")
print("✓ FP8 GEMM completed successfully!")

Result: High-performance FP8 matrix multiplication with automatic kernel optimization.

Part 7: Understanding the Performance Gain

import time

def benchmark_gemm(func, *args, num_runs=10):
    # Warmup
    for _ in range(3):
        func(*args)
    torch.cuda.synchronize()

    # Timing
    start = time.time()
    for _ in range(num_runs):
        func(*args)
    torch.cuda.synchronize()

    return (time.time() - start) / num_runs

# Benchmark both versions
fp8_time = benchmark_gemm(deep_gemm.gemm_fp8_fp8_bf16_nt, lhs_input, rhs_input, output)
bf16_time = benchmark_gemm(lambda x, y, out: out.copy_(x @ y.t()), lhs, rhs, reference)

# Calculate throughput (TFLOPS)
ops = 2 * m * n * k  # Multiply-accumulate operations
fp8_tflops = ops / fp8_time / 1e12
bf16_tflops = ops / bf16_time / 1e12

print(f"FP8 GEMM:  {fp8_time*1000:.2f}ms ({fp8_tflops:.1f} TFLOPS)")
print(f"BF16 GEMM: {bf16_time*1000:.2f}ms ({bf16_tflops:.1f} TFLOPS)")
print(f"Speedup: {bf16_time/fp8_time:.1f}x")

Performance: FP8 can achieve 2-3x speedup on modern GPUs while using half the memory.

Part 8: Grouped GEMM - Processing Multiple Experts

# Simulate MoE (Mixture of Experts) scenario
num_experts = 4
tokens_per_expert = [128, 96, 112, 144]  # Variable tokens per expert
expert_dim = 512

# Create contiguous tensor for all tokens
total_tokens = sum(tokens_per_expert)
alignment = deep_gemm.get_m_alignment_for_contiguous_layout()  # 128

# Align each expert's token count
aligned_tokens = [((t + alignment - 1) // alignment) * alignment for t in tokens_per_expert]
total_aligned = sum(aligned_tokens)

print(f"Original tokens: {tokens_per_expert}")
print(f"Aligned tokens: {aligned_tokens}")
print(f"Total aligned: {total_aligned}")

MoE insight: Different experts process different numbers of tokens - grouping improves efficiency.

Part 9: Setting Up Grouped GEMM Data

# Create inputs for grouped GEMM
lhs_grouped = torch.randn((total_aligned, k), device='cuda', dtype=torch.bfloat16)
rhs_grouped = torch.randn((num_experts, n, k), device='cuda', dtype=torch.bfloat16)
output_grouped = torch.empty((total_aligned, n), device='cuda', dtype=torch.bfloat16)

# Create mapping tensor
m_indices = torch.empty(total_aligned, device='cuda', dtype=torch.int32)
start = 0
for expert_id, (orig_tokens, aligned_tokens) in enumerate(zip(tokens_per_expert, aligned_tokens)):
    # Real tokens get expert ID
    m_indices[start:start + orig_tokens] = expert_id
    # Padding tokens get -1 (ignored)
    m_indices[start + orig_tokens:start + aligned_tokens] = -1
    start += aligned_tokens

print(f"Mapping tensor shape: {m_indices.shape}")
print(f"Expert assignments: {m_indices[:20]}")  # First 20 tokens

Mapping: Each token knows which expert should process it.

Part 10: Converting Grouped Data to FP8

# Convert LHS (same as before)
lhs_grouped_fp8, lhs_grouped_scales = cast_to_fp8_per_token(lhs_grouped)
lhs_grouped_scales = get_col_major_tma_aligned_tensor(lhs_grouped_scales)

# Convert each expert's RHS separately
rhs_grouped_fp8 = torch.empty_like(rhs_grouped, dtype=torch.float8_e4m3fn)
rhs_grouped_scales = torch.empty((num_experts, (n + 127) // 128, (k + 127) // 128), 
                                device='cuda', dtype=torch.float32)

for expert_id in range(num_experts):
    rhs_grouped_fp8[expert_id], rhs_grouped_scales[expert_id] = cast_to_fp8_per_block(rhs_grouped[expert_id])

# Package inputs
lhs_grouped_input = (lhs_grouped_fp8, lhs_grouped_scales)
rhs_grouped_input = (rhs_grouped_fp8, rhs_grouped_scales)

print("✓ Grouped data converted to FP8")

Expert-wise: Each expert has its own scaling factors for optimal precision.

Part 11: Running Grouped GEMM

# Perform grouped GEMM
deep_gemm.m_grouped_gemm_fp8_fp8_bf16_nt_contiguous(
    lhs_grouped_input, 
    rhs_grouped_input, 
    output_grouped, 
    m_indices
)

# Verify by computing reference
reference_grouped = torch.zeros_like(output_grouped)
start = 0
for expert_id, aligned_tokens in enumerate(aligned_tokens):
    end = start + aligned_tokens
    reference_grouped[start:end] = lhs_grouped[start:end] @ rhs_grouped[expert_id].t()
    start = end

# Mask out padding tokens for comparison
valid_mask = (m_indices != -1).unsqueeze(1)
output_masked = torch.where(valid_mask, output_grouped, torch.zeros_like(output_grouped))
reference_masked = torch.where(valid_mask, reference_grouped, torch.zeros_like(reference_grouped))

error = torch.abs(output_masked - reference_masked).max().item()
print(f"Grouped GEMM error: {error:.6f}")
print("✓ Grouped GEMM completed successfully!")

Validation: Compare against standard computation to ensure correctness.

Part 12: Weight Gradient GEMM

# For training: compute weight gradients
def setup_weight_gradient():
    m_grad, k_grad, n_grad = 256, 1024, 512

    # Activations (forward pass)
    activations = torch.randn((m_grad, k_grad), device='cuda', dtype=torch.bfloat16)

    # Gradient w.r.t. output (from backprop)
    grad_output = torch.randn((m_grad, n_grad), device='cuda', dtype=torch.bfloat16)

    # Weight gradient accumulator (typically has residual)
    weight_grad = torch.randn((n_grad, k_grad), device='cuda', dtype=torch.float) * 0.1

    return activations, grad_output, weight_grad

activations, grad_output, weight_grad = setup_weight_gradient()

# Convert to FP8
act_fp8, act_scales = cast_to_fp8_per_token(activations)
grad_fp8, grad_scales = cast_to_fp8_per_token(grad_output)

# Prepare inputs (both need transposed scales)
act_input = (act_fp8, get_col_major_tma_aligned_tensor(act_scales))
grad_input = (grad_fp8, get_col_major_tma_aligned_tensor(grad_scales))

print(f"Weight gradient shape: {weight_grad.shape}")
print(f"Accumulator dtype: {weight_grad.dtype}")  # FP32 for precision

Training context: Weight gradients accumulate many small updates - need FP32 precision.

Part 13: Computing Weight Gradients

# Compute weight gradients with accumulation
original_grad = weight_grad.clone()

deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(grad_input, act_input, weight_grad)

# Verify: grad_output^T @ activations + original_grad
reference_update = grad_output.float().t() @ activations.float()
expected_grad = original_grad + reference_update

error = torch.abs(weight_grad - expected_grad).max().item()
relative_error = error / torch.abs(expected_grad).max().item()

print(f"Weight gradient error: {error:.6f}")
print(f"Relative error: {relative_error*100:.3f}%")
print("✓ Weight gradient computation successful!")

Accumulation: New gradients are added to existing values, enabling mini-batch training.

Part 14: Performance Monitoring

def analyze_gemm_performance(m, n, k, operation="forward"):
    # Theoretical peak performance
    # H100 has ~1600 TFLOPS FP8 peak
    ops = 2 * m * n * k
    peak_time = ops / 1600e12  # Theoretical minimum time

    # Memory bandwidth
    fp8_bytes = (m * k + n * k) * 1  # FP8 inputs
    bf16_bytes = m * n * 2           # BF16 output  
    scale_bytes = ((m * k) // 128 + (n * k) // 128) * 4  # FP32 scales
    total_bytes = fp8_bytes + bf16_bytes + scale_bytes

    # H100 has ~3TB/s memory bandwidth
    bandwidth_time = total_bytes / 3e12

    print(f"\n{operation.upper()} GEMM Analysis (M={m}, N={n}, K={k}):")
    print(f"Operations: {ops/1e9:.1f} GigaOps")
    print(f"Compute bound time: {peak_time*1000:.2f}ms")
    print(f"Memory bound time: {bandwidth_time*1000:.2f}ms")
    print(f"Bottleneck: {'Compute' if peak_time > bandwidth_time else 'Memory'}")

# Analyze our configurations
analyze_gemm_performance(128, 256, 512, "forward")
analyze_gemm_performance(256, 512, 1024, "weight_grad")

Performance tuning: Understanding compute vs memory bottlenecks helps optimize configurations.

Part 15: Advanced Configuration

# Control SM utilization for better efficiency
original_sms = deep_gemm.get_num_sms()
print(f"Default SMs: {original_sms}")

# Use fewer SMs for smaller problems to save power
deep_gemm.set_num_sms(original_sms // 2)
print(f"Reduced SMs: {deep_gemm.get_num_sms()}")

# Run a smaller GEMM
small_lhs = torch.randn((64, 256), device='cuda', dtype=torch.bfloat16)
small_rhs = torch.randn((128, 256), device='cuda', dtype=torch.bfloat16)
small_out = torch.empty((64, 128), device='cuda', dtype=torch.bfloat16)

small_lhs_fp8, small_lhs_scales = cast_to_fp8_per_token(small_lhs)
small_rhs_fp8, small_rhs_scales = cast_to_fp8_per_block(small_rhs)

deep_gemm.gemm_fp8_fp8_bf16_nt(
    (small_lhs_fp8, get_col_major_tma_aligned_tensor(small_lhs_scales)),
    (small_rhs_fp8, small_rhs_scales),
    small_out
)

# Restore original setting
deep_gemm.set_num_sms(original_sms)
print("✓ SM configuration demonstrated")

Resource management: Control GPU utilization for power efficiency and multi-tenancy.

Part 16: Debugging and Validation

def validate_fp8_conversion(original, fp8_data, scales):
    """Check if FP8 conversion preserves data accurately"""
    # Reconstruct original from FP8
    if fp8_data.dim() == 2:
        # Per-token scaling
        m, n = fp8_data.shape
        fp8_view = fp8_data.view(m, -1, 128)
        scales_expanded = scales.unsqueeze(2)
        reconstructed = fp8_view.float() * scales_expanded
        reconstructed = reconstructed.view(m, -1)[:, :original.shape[1]]

    # Compare
    abs_error = torch.abs(original.float() - reconstructed).max().item()
    rel_error = abs_error / torch.abs(original.float()).max().item()

    print(f"FP8 conversion error: {abs_error:.6f} ({rel_error*100:.3f}%)")
    return abs_error < 1e-2  # Reasonable threshold for FP8

# Validate our conversions
lhs_valid = validate_fp8_conversion(lhs, lhs_fp8, lhs_scales)
rhs_valid = validate_fp8_conversion(rhs, rhs_fp8[0], rhs_scales[0])

print(f"LHS conversion valid: {lhs_valid}")
print(f"RHS conversion valid: {rhs_valid}")

Quality assurance: Always validate FP8 conversions to ensure acceptable precision loss.

Part 17: Memory Optimization

def estimate_memory_usage(shapes, operation="gemm"):
    """Estimate GPU memory usage for DeepGEMM operations"""
    m, n, k = shapes

    # Input tensors
    lhs_fp8 = m * k * 1          # FP8
    lhs_scales = m * ((k + 127) // 128) * 4  # FP32
    rhs_fp8 = n * k * 1          # FP8  
    rhs_scales = ((n + 127) // 128) * ((k + 127) // 128) * 4  # FP32

    # Output
    if operation == "gemm":
        output = m * n * 2       # BF16
    else:  # weight_grad
        output = m * n * 4       # FP32

    # Temporary workspace (estimated)
    workspace = max(m, n) * 1024 * 4  # Conservative estimate

    total = lhs_fp8 + lhs_scales + rhs_fp8 + rhs_scales + output + workspace

    print(f"Memory usage for {shapes}:")
    print(f"  Inputs: {(lhs_fp8 + lhs_scales + rhs_fp8 + rhs_scales) / 1024**2:.1f} MB")
    print(f"  Output: {output / 1024**2:.1f} MB")
    print(f"  Workspace: {workspace / 1024**2:.1f} MB")
    print(f"  Total: {total / 1024**2:.1f} MB")

    return total

# Estimate for different problem sizes
estimate_memory_usage((1024, 2048, 4096), "gemm")
estimate_memory_usage((2048, 4096, 8192), "weight_grad")

Capacity planning: Understand memory requirements for different model sizes.

Part 18: Integration with Training Loops

class FP8LinearLayer:
    """Example of integrating DeepGEMM into a training loop"""

    def __init__(self, in_features, out_features):
        self.weight = torch.randn((out_features, in_features), 
                                device='cuda', dtype=torch.bfloat16)
        self.weight_grad = torch.zeros_like(self.weight, dtype=torch.float)

    def forward(self, x):
        # Convert inputs to FP8
        x_fp8, x_scales = cast_to_fp8_per_token(x)
        w_fp8, w_scales = cast_to_fp8_per_block(self.weight)

        # Prepare DeepGEMM inputs
        x_input = (x_fp8, get_col_major_tma_aligned_tensor(x_scales))
        w_input = (w_fp8, w_scales)

        # Allocate output
        output = torch.empty((x.shape[0], self.weight.shape[0]), 
                           device='cuda', dtype=torch.bfloat16)

        # Forward pass
        deep_gemm.gemm_fp8_fp8_bf16_nt(x_input, w_input, output)
        return output

    def backward(self, x, grad_output):
        # Convert to FP8
        x_fp8, x_scales = cast_to_fp8_per_token(x)
        grad_fp8, grad_scales = cast_to_fp8_per_token(grad_output)

        # Prepare inputs
        x_input = (x_fp8, get_col_major_tma_aligned_tensor(x_scales))
        grad_input = (grad_fp8, get_col_major_tma_aligned_tensor(grad_scales))

        # Compute weight gradients: grad_output^T @ x
        deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt(grad_input, x_input, self.weight_grad)

# Demo usage
layer = FP8LinearLayer(512, 256)
x = torch.randn((128, 512), device='cuda', dtype=torch.bfloat16)

# Forward pass
y = layer.forward(x)
print(f"Forward output shape: {y.shape}")

# Backward pass
grad_y = torch.randn_like(y)
layer.backward(x, grad_y)
print(f"Weight grad shape: {layer.weight_grad.shape}")
print("✓ Training loop integration demonstrated")

Real-world usage: How to integrate DeepGEMM into actual neural network training.

Key Takeaways

FP8 = 2x Memory Savings: Half the storage with proper scaling
Scaling is Critical: Per-token and per-block strategies maintain precision
TMA Alignment: Required for optimal hardware utilization
Grouped Operations: Efficient for MoE and variable-size batches
JIT Compilation: Automatic kernel optimization for each shape
Memory Layout Matters: Column-major scales, contiguous tensors
FP32 Accumulation: Use higher precision for gradients

Practice Challenge

# Create an MoE layer with 8 experts
# Process a batch with variable expert utilization
# Measure memory savings vs standard implementation

num_experts = 8
expert_tokens = [64, 128, 96, 112, 88, 144, 72, 104]  # Realistic distribution
hidden_dim = 2048

# Your implementation here:
# 1. Set up grouped GEMM inputs
# 2. Convert to FP8 
# 3. Run DeepGEMM
# 4. Compare with standard PyTorch
# 5. Measure performance and memory usage

print("Challenge: Implement efficient MoE with DeepGEMM!")

NumPy Essentials: Arrays and vectorization

Vuk Rosić — Sun, 06 Jul 2025 21:33:35 +0000

NumPy Essentials: Arrays and Vectorization

Part 1: Getting Started

import numpy as np

# Create your first array
arr = np.array([1, 2, 3, 4, 5])
print(arr)  # [1 2 3 4 5]

What happened: We converted a Python list into a NumPy array - the foundation of scientific computing.

Part 2: Arrays vs Lists

# Python list
python_list = [1, 2, 3, 4, 5]
print(type(python_list))  # <class 'list'>

# NumPy array
numpy_array = np.array([1, 2, 3, 4, 5])
print(type(numpy_array))  # <class 'numpy.ndarray'>

Key difference: Lists store objects, arrays store numbers - much faster for math!

Part 3: Array Properties

arr = np.array([1, 2, 3, 4, 5])
print(arr.shape)   # (5,) - 5 elements in 1 dimension
print(arr.size)    # 5 - total number of elements
print(arr.dtype)   # int64 - data type

Intuition: Shape tells you the dimensions, size tells you total elements.

Part 4: Creating Arrays

# Zeros
zeros = np.zeros(5)          # [0. 0. 0. 0. 0.]

# Ones
ones = np.ones(3)            # [1. 1. 1.]

# Range
range_arr = np.arange(10)    # [0 1 2 3 4 5 6 7 8 9]

# Evenly spaced
linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

Practice: Create arrays filled with specific values or patterns.

Part 5: 2D Arrays (Matrices)

# Create a 2D array
matrix = np.array([[1, 2, 3], 
                   [4, 5, 6]])
print(matrix.shape)  # (2, 3) - 2 rows, 3 columns
print(matrix.size)   # 6 - total elements

Visualization: Think of it as a table with rows and columns.

Part 6: Array Creation Shortcuts

# 2D zeros
zeros_2d = np.zeros((3, 4))    # 3x4 matrix of zeros

# Identity matrix
identity = np.eye(3)           # 3x3 identity matrix

# Random numbers
random_arr = np.random.random(5)  # 5 random numbers [0,1)

Use cases: Initialize matrices for machine learning, create test data.

Part 7: Array Indexing

arr = np.array([10, 20, 30, 40, 50])

# Single element
print(arr[0])    # 10 - first element
print(arr[-1])   # 50 - last element

# Multiple elements
print(arr[1:4])  # [20 30 40] - slice notation

Rule: Same as Python lists, but much faster for large arrays.

Part 8: 2D Array Indexing

matrix = np.array([[1, 2, 3], 
                   [4, 5, 6]])

# Single element
print(matrix[0, 1])  # 2 - row 0, column 1

# Entire row
print(matrix[1, :])  # [4 5 6] - row 1, all columns

# Entire column
print(matrix[:, 2])  # [3 6] - all rows, column 2

Syntax: [row, column] - comma separates dimensions.

Part 9: The Magic of Vectorization

# Python way (slow)
python_list = [1, 2, 3, 4, 5]
result = []
for x in python_list:
    result.append(x * 2)

# NumPy way (fast)
numpy_array = np.array([1, 2, 3, 4, 5])
result = numpy_array * 2  # [2 4 6 8 10]

Vectorization: Apply operations to entire arrays at once - no loops needed!

Part 10: Element-wise Operations

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Basic operations
print(a + b)   # [5 7 9]   - addition
print(a - b)   # [-3 -3 -3] - subtraction
print(a * b)   # [4 10 18] - multiplication
print(a / b)   # [0.25 0.4 0.5] - division

Key insight: Operations happen element-by-element automatically.

Part 11: Broadcasting

# Array and scalar
arr = np.array([1, 2, 3, 4])
result = arr + 10  # [11 12 13 14]

# Different shapes
a = np.array([[1, 2, 3]])      # 1x3
b = np.array([[10], [20]])     # 2x1
result = a + b                 # 2x3 result

Broadcasting: NumPy automatically expands arrays to compatible shapes.

Part 12: Mathematical Functions

arr = np.array([1, 4, 9, 16])

# Common functions
print(np.sqrt(arr))    # [1. 2. 3. 4.] - square root
print(np.log(arr))     # natural logarithm
print(np.exp(arr))     # exponential
print(np.sin(arr))     # sine

Advantage: All functions work element-wise across entire arrays.

Part 13: Array Statistics

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Basic statistics
print(np.mean(data))   # 5.5 - average
print(np.median(data)) # 5.5 - middle value
print(np.std(data))    # 2.87 - standard deviation
print(np.sum(data))    # 55 - total

Use case: Quick analysis of datasets without writing loops.

Part 14: Array Reshaping

arr = np.array([1, 2, 3, 4, 5, 6])

# Reshape to 2x3
reshaped = arr.reshape(2, 3)
print(reshaped)
# [[1 2 3]
#  [4 5 6]]

# Flatten back to 1D
flat = reshaped.flatten()  # [1 2 3 4 5 6]

Rule: Total elements must stay the same (2×3 = 6 elements).

Part 15: Boolean Indexing

data = np.array([1, 5, 3, 8, 2, 9])

# Create boolean mask
mask = data > 4  # [False True False True False True]

# Filter data
filtered = data[mask]  # [5 8 9]

# One-liner
big_numbers = data[data > 4]  # [5 8 9]

Power: Select elements based on conditions without loops.

Part 16: Array Concatenation

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Concatenate
combined = np.concatenate([a, b])  # [1 2 3 4 5 6]

# Stack vertically
stacked = np.vstack([a, b])
# [[1 2 3]
#  [4 5 6]]

Use case: Combine datasets or results from different computations.

Part 17: Matrix Operations

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Element-wise multiplication
element_wise = A * B  # [[5 12] [21 32]]

# Matrix multiplication
matrix_mult = A @ B   # [[19 22] [43 50]]

Difference: * is element-wise, @ is true matrix multiplication.

Part 18: Performance Comparison

import time

# Large arrays
size = 1000000
a = np.random.random(size)
b = np.random.random(size)

# Time NumPy
start = time.time()
result = a + b
numpy_time = time.time() - start

print(f"NumPy time: {numpy_time:.4f} seconds")
# Typically 100x faster than pure Python!

Why faster: NumPy uses optimized C code under the hood.

Part 19: Common Patterns

# Generate data
x = np.linspace(0, 10, 100)  # 100 points from 0 to 10
y = np.sin(x)                # Sine wave

# Find peaks
peaks = y[y > 0.9]

# Normalize data
normalized = (y - np.mean(y)) / np.std(y)

Real-world: Data generation, filtering, and preprocessing.

Part 20: Advanced Indexing

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Fancy indexing
rows = [0, 2]
cols = [1, 2]
result = arr[rows, cols]  # [2 9] - elements at (0,1) and (2,2)

# Boolean indexing with conditions
mask = (arr > 3) & (arr < 8)  # Multiple conditions
filtered = arr[mask]  # [4 5 6 7]

Power: Extract complex patterns from data with simple syntax.

Part 21: Array Sorting

data = np.array([3, 1, 4, 1, 5, 9, 2, 6])

# Sort array
sorted_data = np.sort(data)  # [1 1 2 3 4 5 6 9]

# Get sort indices
indices = np.argsort(data)   # [1 3 6 0 2 7 4 8]

# Sort 2D array
matrix = np.array([[3, 1], [4, 2]])
sorted_matrix = np.sort(matrix, axis=1)  # Sort each row

Use case: Order data for analysis or find top/bottom values.

Part 22: Working with NaN

data = np.array([1, 2, np.nan, 4, 5])

# Check for NaN
has_nan = np.isnan(data)  # [False False True False False]

# Remove NaN
clean_data = data[~np.isnan(data)]  # [1. 2. 4. 5.]

# NaN-aware functions
mean_ignore_nan = np.nanmean(data)  # 3.0

Real data: Often contains missing values - NumPy handles them gracefully.

Part 23: Array Memory and Views

arr = np.array([1, 2, 3, 4, 5])

# Slicing creates a view (shares memory)
view = arr[1:4]
view[0] = 999
print(arr)  # [1 999 3 4 5] - original changed!

# Copy creates new array
copy = arr.copy()
copy[0] = 777
print(arr)  # [1 999 3 4 5] - original unchanged

Memory efficiency: Views save memory, copies ensure independence.

Part 24: Practical Example - Data Analysis

# Simulate temperature data
days = 30
temperatures = np.random.normal(25, 5, days)  # Mean 25°C, std 5°C

# Analysis
avg_temp = np.mean(temperatures)
hot_days = np.sum(temperatures > 30)
cold_days = np.sum(temperatures < 20)
temp_range = np.max(temperatures) - np.min(temperatures)

print(f"Average: {avg_temp:.1f}°C")
print(f"Hot days (>30°C): {hot_days}")
print(f"Cold days (<20°C): {cold_days}")
print(f"Temperature range: {temp_range:.1f}°C")

Real application: Weather data analysis with just a few lines.

Part 25: Image Processing Example

# Create a simple "image" (2D array)
image = np.random.randint(0, 256, (100, 100))  # 100x100 grayscale

# Basic operations
bright_image = image + 50          # Brighten
dark_image = image * 0.5           # Darken
threshold = image > 128            # Binary threshold

# Image statistics
print(f"Average brightness: {np.mean(image):.1f}")
print(f"Bright pixels: {np.sum(image > 200)}")

Application: Images are just arrays of numbers - perfect for NumPy.

Part 26: Scientific Computing

# Simulate a simple physics problem
time = np.linspace(0, 10, 1000)    # Time from 0 to 10 seconds
gravity = 9.81                     # m/s²
initial_velocity = 50              # m/s

# Calculate position (physics equation)
position = initial_velocity * time - 0.5 * gravity * time**2

# Find maximum height
max_height = np.max(position)
max_time = time[np.argmax(position)]

print(f"Maximum height: {max_height:.1f}m at {max_time:.1f}s")

Power: Solve complex scientific problems with vectorized operations.

Part 27: Performance Tips

# Avoid Python loops
# BAD:
result = []
for x in large_array:
    result.append(x**2)

# GOOD:
result = large_array**2

# Use built-in functions
# BAD:
total = 0
for x in large_array:
    total += x

# GOOD:
total = np.sum(large_array)

Golden rule: If you're writing a loop, there's probably a NumPy function for it.

Part 28: Common Mistakes

# Mistake 1: Creating arrays in loops
# BAD:
arr = np.array([])
for i in range(1000):
    arr = np.append(arr, i)  # Slow!

# GOOD:
arr = np.arange(1000)       # Fast!

# Mistake 2: Not using vectorization
# BAD:
result = np.zeros(len(arr))
for i in range(len(arr)):
    result[i] = arr[i] * 2

# GOOD:
result = arr * 2

Efficiency: Pre-allocate arrays and use vectorized operations.

Part 29: Next Steps

# What you can do with NumPy:
# 1. Data analysis (pandas builds on NumPy)
# 2. Machine learning (scikit-learn uses NumPy)
# 3. Image processing (OpenCV, PIL)
# 4. Scientific computing (SciPy)
# 5. Deep learning (TensorFlow, PyTorch)

# Example: Linear regression in one line
X = np.random.random((100, 2))
y = np.random.random(100)
weights = np.linalg.lstsq(X, y, rcond=None)[0]

Foundation: NumPy is the base for the entire Python scientific ecosystem.

Key Takeaways

Arrays > Lists: Faster, more memory efficient for numerical data
Vectorization: Apply operations to entire arrays at once
Broadcasting: Automatically handle different array shapes
Boolean indexing: Filter data with conditions
No loops: NumPy functions are optimized - use them!
Shape matters: Understanding dimensions is crucial
Memory views: Slicing shares memory, copying creates new arrays

Practice Challenge

# Create a 10x10 matrix of random numbers
# Find all numbers greater than 0.5
# Calculate their average
# Replace numbers less than 0.3 with 0

matrix = np.random.random((10, 10))
mask = matrix > 0.5
high_values = matrix[mask]
average = np.mean(high_values)
matrix[matrix < 0.3] = 0

print(f"Found {len(high_values)} values > 0.5")
print(f"Their average: {average:.3f}")

Master these concepts and you'll have a solid foundation for data science, machine learning, and scientific computing!

Unstructuraed notes

Vuk Rosić — Sun, 06 Jul 2025 21:20:06 +0000

(I need to put this somewhere, probably towards the end or in research part)

Note on proper research

(paaraphrasing Keller Jordan, I will shorten or move this somewhere else):

We need a dedicated to AI model speedruns—structured competitions where researchers must train tiny LLMs or similar models under strict constraints.

The goal is to create a fair environment where new methods (like optimizers or architectures) can be tested against fully optimized baselines. Without this, many research get "state-of-the-art" results simply because they didn't optimize existing methods to their limits, and not because their new idea is better.

This is wasting a lot of time for other researchers and teams who implement the method and find out it's not actually better.

It can also motivate companies like OpenAI and Google to open source algorithms they want optimized by open source community.

Attention Mechanism Tutorial: From Simple to Advanced

Vuk Rosić — Sun, 06 Jul 2025 18:38:18 +0000

Part 1: The Core Idea

Attention is like a spotlight - it helps models focus on what's important.

import torch
import torch.nn.functional as F

# Simple example: Which word is most important?
sentence = ["I", "love", "pizza"]
importance = torch.tensor([0.1, 0.3, 0.6])  # pizza is most important

Intuition: Instead of treating all words equally, attention assigns different weights to focus on what matters most.

Part 2: Basic Attention Weights

# Raw attention scores (how much to focus on each word)
scores = torch.tensor([2.0, 1.0, 3.0])  # [I, love, pizza]

# Convert to probabilities (softmax)
weights = F.softmax(scores, dim=0)
print(weights)  # [0.24, 0.09, 0.67] - pizza gets most attention

What happened: Softmax converts raw scores to probabilities that sum to 1.

Part 3: Weighted Combination

# Word representations (simplified vectors)
words = torch.tensor([[1.0, 0.0],  # "I"
                      [0.0, 1.0],  # "love" 
                      [1.0, 1.0]]) # "pizza"

# Apply attention weights
attended = torch.sum(weights.unsqueeze(1) * words, dim=0)
print(attended)  # Mostly "pizza" representation

Intuition: We combine all word vectors, but "pizza" contributes most because it has the highest attention weight.

Part 4: Computing Attention Scores

# How similar are words? (dot product)
query = torch.tensor([1.0, 1.0])  # What we're looking for
key1 = torch.tensor([1.0, 0.0])  # "I"
key2 = torch.tensor([0.0, 1.0])  # "love"

score1 = torch.dot(query, key1)  # 1.0
score2 = torch.dot(query, key2)  # 1.0

Theory: Attention scores measure how well a query matches each key.

Part 5: The Q, K, V Concept

# Three roles for each word:
# Q (Query): "What am I looking for?"
# K (Key): "What do I represent?" 
# V (Value): "What information do I carry?"

query = torch.tensor([1.0, 0.0])    # Looking for subject
keys = torch.tensor([[1.0, 0.0],   # "I" - matches query well
                     [0.0, 1.0]])   # "love" - doesn't match
values = torch.tensor([[2.0, 3.0], # "I" carries this info
                       [1.0, 4.0]]) # "love" carries this info

Intuition: Query asks "what do I need?", Keys answer "what do I offer?", Values provide the actual information.

Part 6: One-Line Attention

# Complete attention in one line
attention_output = torch.sum(F.softmax(torch.mv(keys, query), dim=0).unsqueeze(1) * values, dim=0)

What it does: Computes scores (query·keys), applies softmax, weights the values.

Part 7: Self-Attention Intuition

# In self-attention, each word can attend to every other word
sentence = ["The", "cat", "sat"]
# "cat" might attend to "sat" (what did the cat do?)
# "sat" might attend to "cat" (who sat?)

Key insight: Words can look at each other to understand relationships and context.

Part 8: Multi-Head Attention (Simple)

# Multiple "attention heads" look for different things
head1_query = torch.tensor([1.0, 0.0])  # Looking for subjects
head2_query = torch.tensor([0.0, 1.0])  # Looking for actions

# Each head focuses on different aspects

Why multiple heads: Different heads can specialize in different types of relationships (subject-verb, adjective-noun, etc.).

Part 9: Scaling Up

# Real sentences have many words
seq_len = 10  # 10 words in sentence
d_model = 64  # Each word is 64-dimensional vector

# Q, K, V matrices transform word vectors
Q = torch.randn(seq_len, d_model)  # Queries for each word
K = torch.randn(seq_len, d_model)  # Keys for each word
V = torch.randn(seq_len, d_model)  # Values for each word

Scale: Real models use hundreds of dimensions and thousands of words.

Part 10: Attention Matrix

# Attention scores between all word pairs
attention_scores = torch.mm(Q, K.transpose(0, 1))  # [10, 10] matrix
attention_weights = F.softmax(attention_scores, dim=1)  # Each row sums to 1

# Row i, column j = how much word i attends to word j

Visualization: Each row shows where one word "looks" in the sentence.

Part 11: Why Attention Works

# Traditional RNN: Information flows sequentially
# Word 1 → Word 2 → Word 3 → Word 4

# Attention: All words can interact directly
# Word 1 ↔ Word 2 ↔ Word 3 ↔ Word 4

Advantage: No information loss over long distances, parallel processing.

Part 12: Putting It All Together

# Complete self-attention step by step
def simple_attention(X):
    Q = X  # Queries (simplified)
    K = X  # Keys 
    V = X  # Values

    scores = torch.mm(Q, K.transpose(0, 1))  # Compute similarities
    weights = F.softmax(scores, dim=1)        # Convert to probabilities
    output = torch.mm(weights, V)             # Weighted combination
    return output

# Usage
word_vectors = torch.randn(5, 8)  # 5 words, 8 dimensions each
attended_vectors = simple_attention(word_vectors)

Result: Each word vector is now updated with information from all other words, weighted by attention.

Key Takeaways

Attention = Weighted Average: Focus more on important parts
Q·K = Similarity: How well query matches key
Softmax = Probability: Convert scores to weights that sum to 1
Weighted V = Output: Combine values using attention weights
Self-Attention = Words talking to each other: Every word can attend to every other word

This foundation prepares you for transformer models, which are built entirely on attention mechanisms!

Understanding Attention: From Words to Vectors

1. Word Embeddings - The Foundation

import torch
import torch.nn as nn
import torch.nn.functional as F

# Sample sentence: "The cat sat on the mat"
vocab = {"<pad>": 0, "the": 1, "cat": 2, "sat": 3, "on": 4, "mat": 5}
sentence = [1, 2, 3, 4, 1, 5]  # token IDs

# Create embeddings
vocab_size = len(vocab)
embed_dim = 64
embedding = nn.Embedding(vocab_size, embed_dim)

# Convert tokens to vectors
tokens = torch.tensor(sentence)
embeddings = embedding(tokens)
print(f"Shape: {embeddings.shape}")  # [6, 64]
print(f"'cat' vector: {embeddings[1][:8]}...")  # First 8 dimensions

Each word becomes a 64-dimensional vector that captures semantic meaning.

2. The Q, K, V Matrices - Core of Attention

# Attention dimensions
d_model = 64
num_heads = 8
d_k = d_model // num_heads  # 8

# Linear transformations to create Q, K, V
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)

# Transform embeddings
Q = W_q(embeddings)  # Queries: "What am I looking for?"
K = W_k(embeddings)  # Keys: "What do I represent?"
V = W_v(embeddings)  # Values: "What information do I carry?"

print(f"Q shape: {Q.shape}")  # [6, 64]
print(f"K shape: {K.shape}")  # [6, 64]
print(f"V shape: {V.shape}")  # [6, 64]

Intuition:

Q (Query): "What information does this word need?"
K (Key): "What kind of information does this word offer?"
V (Value): "What actual information does this word contain?"

3. Computing Attention Scores

# Reshape for multi-head attention
batch_size, seq_len = 1, 6
Q = Q.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]
K = K.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]
V = V.view(batch_size, seq_len, num_heads, d_k).transpose(1, 2)  # [1, 8, 6, 8]

# Attention scores: How much should each word pay attention to others?
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
print(f"Attention scores shape: {scores.shape}")  # [1, 8, 6, 6]

# Example: How much does "cat" attend to each word?
cat_attention = scores[0, 0, 1, :]  # First head, "cat" position
words = ["the", "cat", "sat", "on", "the", "mat"]
for i, word in enumerate(words):
    print(f"cat -> {word}: {cat_attention[i]:.3f}")

4. Softmax and Weighted Values

# Convert scores to probabilities
attention_weights = F.softmax(scores, dim=-1)
print(f"Attention weights shape: {attention_weights.shape}")  # [1, 8, 6, 6]

# Apply attention to values
attended_values = torch.matmul(attention_weights, V)  # [1, 8, 6, 8]

# Concatenate heads and project back
attended_values = attended_values.transpose(1, 2).contiguous().view(
    batch_size, seq_len, d_model)  # [1, 6, 64]

print(f"Final attended values shape: {attended_values.shape}")

# Show attention pattern for "cat"
print("\nAttention pattern for 'cat':")
cat_weights = attention_weights[0, 0, 1, :]  # First head
for i, word in enumerate(words):
    print(f"  {word}: {cat_weights[i]:.3f}")

5. Complete Self-Attention Implementation

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, d_model = x.size()

        # Linear transformations
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attention_weights = F.softmax(scores, dim=-1)
        attended_values = torch.matmul(attention_weights, V)

        # Concatenate heads
        attended_values = attended_values.transpose(1, 2).contiguous().view(
            batch_size, seq_len, d_model)

        # Final projection
        output = self.W_o(attended_values)
        return output, attention_weights

# Usage
attention = MultiHeadAttention(d_model=64, num_heads=8)
output, weights = attention(embeddings.unsqueeze(0))
print(f"Output shape: {output.shape}")  # [1, 6, 64]

6. Visualizing Attention Patterns

# Extract attention weights for visualization
attention_matrix = weights[0, 0].detach().numpy()  # First head
words = ["the", "cat", "sat", "on", "the", "mat"]

print("Attention Matrix (first head):")
print("From -> To:")
for i, from_word in enumerate(words):
    print(f"{from_word:>4}: ", end="")
    for j, to_word in enumerate(words):
        print(f"{attention_matrix[i,j]:.2f} ", end="")
    print()

7. Key Insights

What happens in attention?

Each word creates a query (what it's looking for)
Each word creates a key (what it represents)
We compute similarity between queries and keys
Higher similarity = more attention
We use attention weights to combine values (actual information)

Example: When processing "cat", the model might:

Query: "I need information about animals"
Look at all keys: "the" (determiner), "sat" (action), "mat" (object)
Pay most attention to "sat" because it's the relevant action
Combine information weighted by attention scores

8. Practical Example with Real Meaning

# Sentence: "The cat chased the mouse"
sentence = "The cat chased the mouse"
words = sentence.lower().split()

# Simulate what attention might learn
print("Attention patterns the model might learn:")
print("- 'cat' attends to 'chased' (subject-verb relationship)")
print("- 'chased' attends to 'cat' and 'mouse' (verb-subject-object)")
print("- 'mouse' attends to 'chased' (object-verb relationship)")
print("- 'the' attends to following nouns ('cat', 'mouse')")

# This allows the model to understand:
# - Who did what to whom
# - Grammatical relationships
# - Semantic dependencies

Summary

Attention mechanism allows models to:

Focus on relevant parts of the input
Relate different words to each other
Combine information based on relevance
Understand long-range dependencies

The magic is in the learned Q, K, V matrices that transform word embeddings into queries, keys, and values that can interact meaningfully.

# Introduction & Motivation

Vuk Rosić — Sun, 06 Jul 2025 18:34:34 +0000

This is introduction and motivation for my course Zero To Mastery AI Researcher & Engineer (in development)

I will build this in public so you can help me figure out what this should be like.

This article is under development but new parts and videos will come almost daily so you can start learning & participating now!

In the end we will try to break the world record on fastest GPT-2 pretraining, which is currently 3 minutes (we can do it by simply implementing DeepSeek's DeepGEMM for 30-50% faster matmuls on H100, but we want to first deeply understand the whole field and be able to build it all from scratch.

My plan is to reach level where I can reproduce GPT-3.5 as open source, this is why I will focus on depth of understanding.

I will publish videos on my YouTube, as well as big full YouTube course in the end.

I am looking for your feedback here and on YouTube.

Which title is best:

0 To World-Class AI Researcher & Engineer In 1 Article
Learn AI Research & Engineering In 1 Article
Everything To Become Researcher & Engineer In 1 Article

Ilya Sutskever:

I found that if I read something very slowly, I will eventually uderstand it.
Coming up with new ideas is modest, it’s more imprtant to understand the results, existing ideas and what’s going on. Main activity is understanding.

Reading new papers & ideas is cool but progress mostly comes from deeply understanding current ones.

Why Open Source

Future of AI & humanity should not be locked behind paywalls, APIs, or corporate vaults.

Big companies can build products, serve customers, and innovate — that’s their role, but the future must be shaped in the open and by everybody.

I believe you — the curious coder, the student with a second-hand laptop, the researcher with no access to H100s — will help build & direct the most powerful technology in human history.

We don’t need billion-dollar clusters — we need brilliant minds across MIT, Tsinghua, random cafes, and Discord channels.

By the people, for the people.

🤝 Help others
📺 Read, watch, learn
🧑‍💻 Fork the code
📈 Beat the benchmarks
🌍 Make AI open

Alignment Through Transparency

Safety in AI comes not from secrecy, but from exposure. Only through open code, open evaluations, and public scrutiny can we ensure that AI systems are robust, fair, and aligned with everybody.

Openness accelerates innovation, invites scrutiny, and ensures that knowledge does not become a weapon of the elite.

Understand. Create. Open.

I recommend you master every topic before going to the next one.
If there is not enough material or it's not explained well, please comment below with the feedback, that's why I'm building it in the open.

Python for Machine Learning: From Simple to Advanced

Vuk Rosić — Sun, 06 Jul 2025 18:24:17 +0000

Part 1: The Core Idea

Machine Learning = Finding patterns in data to make predictions.

# Simple pattern: Height predicts weight
height = 170  # cm
weight = height * 0.5  # Simple rule: weight = height * 0.5
print(f"Predicted weight: {weight} kg")

Intuition: We find mathematical relationships between inputs (height) and outputs (weight).

Part 2: Working with Data

import numpy as np

# Data is just numbers in arrays
heights = np.array([160, 170, 180, 190])  # Input features
weights = np.array([60, 70, 80, 90])      # Target values

print(f"Average height: {heights.mean()}")

What happened: NumPy arrays store our data efficiently and provide math operations.

Part 3: Finding Patterns

# Find the best line through data points
slope = np.corrcoef(heights, weights)[0,1]  # Correlation
print(f"Correlation: {slope:.2f}")  # Close to 1 = strong pattern

Intuition: Correlation tells us how strongly two variables are related.

Part 4: Making Predictions

# Simple linear prediction
def predict_weight(height):
    return height * 0.5 - 15  # Our discovered pattern

new_height = 175
predicted = predict_weight(new_height)
print(f"Height {new_height}cm → Weight {predicted}kg")

Key insight: Once we find the pattern, we can predict new values.

Part 5: Measuring Errors

# How wrong are our predictions?
actual = np.array([65, 75, 85, 95])
predicted = np.array([60, 70, 80, 90])

error = np.mean((actual - predicted) ** 2)  # Mean Squared Error
print(f"Average error: {error}")

Why this matters: We need to know how good our predictions are.

Part 6: Learning from Data

from sklearn.linear_model import LinearRegression

# Let the computer find the pattern
model = LinearRegression()
model.fit(heights.reshape(-1, 1), weights)  # Learn from data

# Make predictions
prediction = model.predict([[175]])
print(f"Learned prediction: {prediction[0]:.1f}kg")

Magic moment: The computer automatically finds the best line through our data.

Part 7: Train/Test Split

from sklearn.model_selection import train_test_split

# Split data: some for learning, some for testing
X_train, X_test, y_train, y_test = train_test_split(heights.reshape(-1,1), weights, test_size=0.5)

model.fit(X_train, y_train)  # Learn from training data
score = model.score(X_test, y_test)  # Test on unseen data
print(f"Model accuracy: {score:.2f}")

Why split: We test on data the model hasn't seen to avoid cheating.

Part 8: Multiple Features

# Use multiple inputs for better predictions
data = np.array([[170, 25],  # [height, age]
                 [180, 30],
                 [160, 20],
                 [175, 35]])
weights = np.array([70, 80, 60, 75])

model.fit(data, weights)  # Learn from height AND age
prediction = model.predict([[172, 28]])  # Predict using both features

Power of ML: Use many features to make better predictions.

Part 9: Classification vs Regression

# Regression: Predict numbers (weight, price, temperature)
regressor = LinearRegression()

# Classification: Predict categories (spam/not spam, cat/dog)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

# Same interface, different problems

Two main types: Predicting numbers vs predicting categories.

Part 10: Real Data with Pandas

import pandas as pd

# Load real data
df = pd.DataFrame({
    'height': [160, 170, 180, 190, 165],
    'weight': [60, 70, 80, 90, 65],
    'age': [25, 30, 35, 40, 28]
})

print(df.head())  # See first few rows
print(df.describe())  # Get statistics

Pandas power: Handle real-world messy data with ease.

Part 11: Data Preprocessing

# Clean and prepare data
df['bmi'] = df['weight'] / (df['height']/100)**2  # Create new feature
df = df.dropna()  # Remove missing values

# Separate features and target
X = df[['height', 'age']]  # Features
y = df['weight']           # Target

Essential step: Clean data before feeding to algorithms.

Part 12: Different Algorithms

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Try different algorithms
models = {
    'Linear': LinearRegression(),
    'Tree': DecisionTreeRegressor(),
    'Forest': RandomForestRegressor(),
    'SVM': SVR()
}

# Each finds patterns differently

Algorithm zoo: Different algorithms good for different problems.

Part 13: Model Evaluation

from sklearn.metrics import mean_squared_error, r2_score

# Evaluate model performance
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")  # Lower is better
print(f"R²: {r2:.2f}")     # Higher is better (max 1.0)

Metrics matter: Different ways to measure how good your model is.

Part 14: Cross-Validation

from sklearn.model_selection import cross_val_score

# Test model on multiple train/test splits
scores = cross_val_score(model, X, y, cv=5)  # 5-fold cross-validation
print(f"Average score: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")

Robust testing: Get more reliable estimate of model performance.

Part 15: Feature Engineering

# Create better features
df['height_squared'] = df['height'] ** 2
df['age_height'] = df['age'] * df['height']  # Interaction feature

# Sometimes simple transformations improve predictions

Domain knowledge: Understanding your data helps create better features.

Part 16: Handling Categorical Data

# Text categories need special handling
df['gender'] = ['M', 'F', 'M', 'F', 'M']

# Convert to numbers
df_encoded = pd.get_dummies(df, columns=['gender'])
print(df_encoded.columns)  # gender_F, gender_M columns

Encoding: Convert text to numbers for ML algorithms.

Part 17: Scaling Features

from sklearn.preprocessing import StandardScaler

# Scale features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Mean=0, Std=1

# Some algorithms work better with scaled data

Why scale: Prevents features with large values from dominating.

Part 18: Pipeline

from sklearn.pipeline import Pipeline

# Chain preprocessing and modeling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor())
])

pipeline.fit(X_train, y_train)  # Scaling and training in one step

Clean workflow: Combines preprocessing and modeling automatically.

Part 19: Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Find best model settings
param_grid = {'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

Optimization: Automatically find best settings for your model.

Part 20: Putting It All Together

# Complete ML workflow
def ml_workflow(data, target_column):
    # 1. Split features and target
    X = data.drop(target_column, axis=1)
    y = data[target_column]

    # 2. Train/test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # 3. Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', RandomForestRegressor())
    ])

    # 4. Train model
    pipeline.fit(X_train, y_train)

    # 5. Evaluate
    score = pipeline.score(X_test, y_test)
    return pipeline, score

# Usage
model, accuracy = ml_workflow(df, 'weight')
print(f"Model accuracy: {accuracy:.2f}")

Complete solution: From raw data to trained model in one function.

Key Takeaways

Data = Numbers: Everything must be converted to numbers
Patterns = Models: Algorithms find mathematical relationships
Train/Test = Validation: Always test on unseen data
Features = Input: Good features make good predictions
Metrics = Evaluation: Measure how well your model works
Pipeline = Workflow: Combine steps for clean, reproducible ML

This foundation gives you the tools to solve real machine learning problems!

Zero To Mastery AI Researcher & Engineer (in development)

Vuk Rosić — Sun, 06 Jul 2025 18:12:19 +0000

Goal is for learners to be able to join top tier labs (OpenAI, Google, MIT) or publish cutting edge open source research & code models.

Introduction & Motivation

Python for Machine Learning: From Simple to Advanced

Code a Neural Network from Scratch in NumPy

Google Colab
Article

Attention Mechanism Tutorial: From Simple to Advanced

Unstructuraed notes

Machine Learning & Deep Learning Course Syllabus

Module 1: Foundations

Building blocks of machine learning and computation

1.1 Programming Fundamentals

1.1.1 Python Basics
- Variables and data types
- Lists, dictionaries, functions
- Control flow and loops
- File I/O and string manipulation
1.1.2 NumPy Essentials
- Arrays and vectorization
- Broadcasting and indexing
- Mathematical operations
- Random number generation
1.1.3 Data Manipulation
- Pandas fundamentals
- Data cleaning and preprocessing
- Handling missing values
- Basic statistics and aggregations

1.2 Mathematical Foundations

1.2.1 Linear Algebra
- Vectors and matrices
- Matrix multiplication
- Eigenvalues and eigenvectors
- Norms and distances
1.2.2 Calculus for ML
- Derivatives and gradients
- Chain rule
- Partial derivatives
- Optimization basics
1.2.3 Probability and Statistics
- Probability distributions
- Bayes' theorem
- Expected value and variance
- Sampling and estimation

1.3 Data Types and Representation

1.3.1 Numerical Data
- Integers and floating point
- Precision and overflow
- Binary representation
- Normalization techniques
1.3.2 Text Data
- ASCII and Unicode
- UTF-8 encoding
- String processing
- Regular expressions
1.3.3 Tensor Operations
- Tensor shapes and dimensions
- Views and strides
- Contiguous memory layout
- Broadcasting rules

Module 2: Core Machine Learning

Traditional ML algorithms and concepts

2.1 Supervised Learning Fundamentals

2.1.1 Linear Models
- Linear regression
- Logistic regression
- Regularization (L1, L2)
- Feature engineering
2.1.2 Tree-Based Methods
- Decision trees
- Random forests
- Gradient boosting
- Feature importance
2.1.3 Instance-Based Learning
- k-Nearest Neighbors
- Distance metrics
- Curse of dimensionality
- Locality sensitive hashing

2.2 Unsupervised Learning

2.2.1 Clustering
- K-means clustering
- Hierarchical clustering
- DBSCAN
- Cluster evaluation
2.2.2 Dimensionality Reduction
- Principal Component Analysis (PCA)
- t-SNE
- UMAP
- Manifold learning
2.2.3 Association Rules
- Market basket analysis
- Apriori algorithm
- FP-growth
- Lift and confidence

2.3 Model Evaluation and Selection

2.3.1 Performance Metrics
- Accuracy, precision, recall
- F1-score and ROC curves
- Regression metrics (MSE, MAE, R²)
- Cross-validation
2.3.2 Bias-Variance Tradeoff
- Overfitting and underfitting
- Model complexity
- Learning curves
- Regularization strategies
2.3.3 Hyperparameter Tuning
- Grid search
- Random search
- Bayesian optimization
- Automated ML (AutoML)

Module 3: Language Modeling Fundamentals

Building language models from scratch

3.1 Statistical Language Models

3.1.1 Bigram Language Models
- Text preprocessing and tokenization
- Bigram counting and probabilities
- Smoothing techniques
- Text generation and sampling
3.1.2 N-gram Models
- Extending to trigrams and beyond
- Backoff and interpolation
- Perplexity evaluation
- Limitations of n-grams
3.1.3 Language Model Evaluation
- Perplexity and cross-entropy
- Held-out evaluation
- Human evaluation
- Downstream task evaluation

3.2 Neural Language Models

3.2.1 Feed-Forward Networks
- Multi-layer perceptrons
- Activation functions
- Universal approximation theorem
- Gradient descent basics
3.2.2 Backpropagation
- Micrograd implementation
- Computational graphs
- Automatic differentiation
- Chain rule in practice
3.2.3 Word Embeddings
- Distributed representations
- Word2Vec and GloVe
- Embedding spaces
- Analogies and relationships

Module 4: Deep Learning Foundations

Core neural network concepts and architectures

4.1 Neural Network Fundamentals

4.1.1 Perceptron and Multi-layer Networks
- Single perceptron
- Multi-layer perceptron (MLP)
- Matrix multiplication implementation
- Activation functions (ReLU, GELU, Sigmoid)
4.1.2 Training Neural Networks
- Loss functions
- Gradient descent variants
- Learning rate scheduling
- Batch processing
4.1.3 Regularization Techniques
- Dropout
- Batch normalization
- Weight decay
- Early stopping

4.2 Sequence Models

4.2.1 Recurrent Neural Networks
- Vanilla RNNs
- Backpropagation through time
- Vanishing gradient problem
- Bidirectional RNNs
4.2.2 LSTM and GRU
- Long Short-Term Memory
- Gated Recurrent Units
- Forget gates and memory cells
- Sequence-to-sequence models
4.2.3 Convolutional Networks
- 1D convolutions for text
- Pooling operations
- CNN architectures
- Feature maps and filters

4.3 Attention Mechanisms

4.3.1 Attention Fundamentals
- Attention intuition
- Query, Key, Value matrices
- Attention scores and weights
- Scaled dot-product attention
4.3.2 Self-Attention
- Self-attention mechanism
- Multi-head attention
- Positional encoding
- Attention visualization
4.3.3 Advanced Attention
- Sparse attention patterns
- Local attention
- Cross-attention
- Attention variants

Module 5: Transformers and Modern Architectures

State-of-the-art language models

5.1 Transformer Architecture

5.1.1 Transformer Components
- Encoder-decoder structure
- Multi-head self-attention
- Feed-forward networks
- Residual connections
5.1.2 Layer Normalization
- LayerNorm vs BatchNorm
- Pre-norm vs post-norm
- RMSNorm variants
- Normalization placement
5.1.3 Positional Encoding
- Sinusoidal encoding
- Learned positional embeddings
- Rotary Position Embedding (RoPE)
- Relative position encoding

5.2 GPT Architecture

5.2.1 GPT-1 and GPT-2
- Decoder-only architecture
- Causal self-attention
- Autoregressive generation
- Scaling laws
5.2.2 GPT-3 and Beyond
- Few-shot learning
- In-context learning
- Emergent abilities
- Instruction following
5.2.3 Modern Variants
- Llama architecture
- Grouped Query Attention (GQA)
- Mixture of Experts (MoE)
- Retrieval-augmented generation

5.3 Training Large Models

5.3.1 Optimization Techniques
- AdamW optimizer
- Learning rate scheduling
- Gradient clipping
- Weight initialization
5.3.2 Scaling Considerations
- Parameter scaling
- Compute scaling
- Data scaling
- Optimal compute allocation
5.3.3 Stability and Convergence
- Training dynamics
- Loss spikes and recovery
- Gradient monitoring
- Debugging techniques

Module 6: Tokenization and Data Processing

Text preprocessing and representation

6.1 Tokenization Fundamentals

6.1.1 Basic Tokenization
- Whitespace tokenization
- Punctuation handling
- Case normalization
- Unicode considerations
6.1.2 Subword Tokenization
- Byte Pair Encoding (BPE)
- WordPiece tokenization
- SentencePiece
- Unigram tokenization
6.1.3 MinBPE Implementation
- BPE algorithm from scratch
- Merge operations
- Vocabulary building
- Encoding and decoding

6.2 Advanced Tokenization

6.2.1 Handling Special Cases
- Out-of-vocabulary words
- Numbers and dates
- Code and structured text
- Multilingual tokenization
6.2.2 Tokenizer Training
- Corpus preparation
- Vocabulary size selection
- Special tokens
- Tokenizer evaluation
6.2.3 Tokenization Trade-offs
- Vocabulary size vs. sequence length
- Compression efficiency
- Downstream task performance
- Computational considerations

6.3 Data Pipeline

6.3.1 Data Loading
- Efficient data loading
- Streaming large datasets
- Data sharding
- Memory management
6.3.2 Data Preprocessing
- Text cleaning
- Deduplication
- Quality filtering
- Format standardization
6.3.3 Synthetic Data Generation
- Data augmentation
- Synthetic data creation
- Curriculum learning
- Domain adaptation

Module 7: Optimization and Training

Advanced training techniques and optimization

7.1 Optimization Algorithms

7.1.1 Gradient Descent Variants
- SGD with momentum
- AdaGrad and AdaDelta
- RMSprop
- Adam and AdamW
7.1.2 Advanced Optimizers
- LAMB optimizer
- Lookahead optimizer
- Shampoo optimizer
- Second-order methods
7.1.3 Learning Rate Scheduling
- Step decay
- Cosine annealing
- Warm restarts
- Cyclical learning rates

7.2 Training Strategies

7.2.1 Initialization Techniques
- Xavier/Glorot initialization
- He initialization
- Layer-wise initialization
- Transfer learning initialization
7.2.2 Regularization Methods
- Dropout variants
- DropConnect
- Stochastic depth
- Mixup and CutMix
7.2.3 Curriculum Learning
- Easy-to-hard scheduling
- Data ordering strategies
- Multi-task learning
- Progressive training

7.3 Training Monitoring

7.3.1 Metrics and Logging
- Loss monitoring
- Gradient norms
- Learning rate tracking
- Model checkpointing
7.3.2 Debugging Training
- Gradient flow analysis
- Activation monitoring
- Weight distribution analysis
- Convergence diagnostics
7.3.3 Hyperparameter Tuning
- Grid and random search
- Bayesian optimization
- Population-based training
- Multi-objective optimization

Module 8: Efficient Computing

Hardware acceleration and optimization

8.1 Device Optimization

8.1.1 CPU vs GPU Computing
- CPU architecture
- GPU parallelism
- Memory hierarchies
- Compute vs memory bound
8.1.2 GPU Programming
- CUDA basics
- Tensor cores
- Memory coalescing
- Kernel optimization
8.1.3 Specialized Hardware
- TPUs and tensor processing
- FPGAs for inference
- Neuromorphic computing
- Edge computing devices

8.2 Precision and Quantization

8.2.1 Numerical Precision
- FP32, FP16, BF16
- Mixed precision training
- Automatic mixed precision
- Precision-accuracy trade-offs
8.2.2 Quantization Techniques
- Post-training quantization
- Quantization-aware training
- Dynamic quantization
- Pruning and sparsity
8.2.3 Low-Precision Inference
- INT8 quantization
- Binary neural networks
- FP8 formats
- Hardware-specific optimizations

8.3 Distributed Training

8.3.1 Data Parallelism
- Distributed Data Parallel (DDP)
- Gradient synchronization
- Communication optimization
- Fault tolerance
8.3.2 Model Parallelism
- Pipeline parallelism
- Tensor parallelism
- ZeRO optimizer states
- Gradient checkpointing
8.3.3 Large-Scale Training
- Multi-node training
- Communication backends
- Load balancing
- Scalability analysis

Module 9: Inference and Deployment

Model serving and optimization

9.1 Inference Optimization

9.1.1 KV-Cache
- Key-value caching
- Memory management
- Cache optimization
- Batched inference
9.1.2 Model Compression
- Pruning techniques
- Knowledge distillation
- Model quantization
- Neural architecture search
9.1.3 Serving Optimization
- Batch processing
- Dynamic batching
- Continuous batching
- Speculative decoding

9.2 Production Deployment

9.2.1 Model Serving
- REST API design
- gRPC services
- WebSocket connections
- Load balancing
9.2.2 Containerization
- Docker containers
- Kubernetes orchestration
- Serverless deployment
- Edge deployment
9.2.3 Monitoring and Observability
- Performance metrics
- Model monitoring
- A/B testing
- Failure detection

9.3 Scalability and Reliability

9.3.1 Horizontal Scaling
- Load balancing
- Auto-scaling
- Resource management
- Cost optimization
9.3.2 Fault Tolerance
- Error handling
- Fallback mechanisms
- Circuit breakers
- Graceful degradation
9.3.3 Security and Privacy
- Input validation
- Rate limiting
- Data privacy
- Model security

Module 10: Fine-tuning and Adaptation

Customizing models for specific tasks

10.1 Supervised Fine-tuning

10.1.1 Full Fine-tuning
- Transfer learning
- Layer freezing
- Learning rate scheduling
- Catastrophic forgetting
10.1.2 Parameter-Efficient Fine-tuning
- Low-Rank Adaptation (LoRA)
- Adapters
- Prompt tuning
- Prefix tuning
10.1.3 Task-Specific Adaptation
- Classification fine-tuning
- Generation fine-tuning
- Multi-task learning
- Domain adaptation

10.2 Reinforcement Learning

10.2.1 RL Fundamentals
- Markov Decision Processes
- Policy gradient methods
- Value-based methods
- Actor-critic algorithms
10.2.2 RLHF (Reinforcement Learning from Human Feedback)
- Reward modeling
- Proximal Policy Optimization (PPO)
- Direct Preference Optimization (DPO)
- Constitutional AI
10.2.3 Advanced RL Techniques
- Multi-agent RL
- Hierarchical RL
- Meta-learning
- Offline RL

10.3 Alignment and Safety

10.3.1 AI Alignment
- Value alignment
- Reward hacking
- Goodhart's law
- Interpretability
10.3.2 Safety Measures
- Content filtering
- Bias detection
- Adversarial robustness
- Failure modes
10.3.3 Evaluation and Testing
- Benchmarking
- Human evaluation
- Stress testing
- Red teaming

Module 11: Multimodal Learning

Beyond text: images, audio, and video

11.1 Vision-Language Models

11.1.1 Image Processing
- Convolutional neural networks
- Vision transformers
- Image tokenization
- Patch embeddings
11.1.2 Vision-Language Architectures
- CLIP model
- Cross-modal attention
- Unified encoders
- Multimodal fusion
11.1.3 Applications
- Image captioning
- Visual question answering
- Text-to-image generation
- Multimodal reasoning

11.2 Generative Models

11.2.1 Variational Autoencoders
- VAE fundamentals
- VQVAE and VQGAN
- Discrete representation learning
- Reconstruction quality
11.2.2 Diffusion Models
- Denoising diffusion models
- Stable diffusion
- Diffusion transformers
- Classifier-free guidance
11.2.3 Autoregressive Models
- PixelCNN and PixelRNN
- Autoregressive transformers
- Masked language modeling
- Bidirectional generation

11.3 Audio and Video

11.3.1 Audio Processing
- Speech recognition
- Text-to-speech
- Audio generation
- Music modeling
11.3.2 Video Understanding
- Video transformers
- Temporal modeling
- Action recognition
- Video generation
11.3.3 Multimodal Integration
- Audio-visual learning
- Cross-modal retrieval
- Multimodal reasoning
- Unified architectures

Module 12: Research and Advanced Topics

Cutting-edge developments and research directions

12.1 Emerging Architectures

12.1.1 Alternative Architectures
- State space models
- Retrieval-augmented generation
- Memory networks
- Neural Turing machines
12.1.2 Efficiency Improvements
- Linear attention
- Sparse transformers
- Efficient architectures
- Mobile-friendly models
12.1.3 Scaling Laws
- Compute scaling
- Data scaling
- Parameter scaling
- Emergence and phase transitions

12.2 Advanced Training Techniques

12.2.1 Self-Supervised Learning
- Contrastive learning
- Masked language modeling
- Next sentence prediction
- Pretext tasks
12.2.2 Few-Shot Learning
- Meta-learning
- In-context learning
- Prompt engineering
- Chain-of-thought reasoning
12.2.3 Continual Learning
- Lifelong learning
- Catastrophic forgetting
- Elastic weight consolidation
- Progressive neural networks

12.3 Evaluation and Benchmarking

12.3.1 Evaluation Frameworks
- Standardized benchmarks
- Evaluation metrics
- Human evaluation
- Automated evaluation
12.3.2 Bias and Fairness
- Bias detection
- Fairness metrics
- Debiasing techniques
- Ethical considerations
12.3.3 Interpretability
- Attention visualization
- Gradient-based methods
- Concept activation vectors
- Mechanistic interpretability

Each subtopic at the lowest level (e.g., 1.1.1, 1.1.2) represents a complete lesson that can be developed into a step-by-step tutorial with code examples, intuitive explanations, and practical exercises.

I will move a bunch of these into optional?