Training an LLM in Swift: Optimizing Matrix Multiplication from Gflop/s to Tflop/s

#webdev #devops #cloud #astro

A transformer forward pass is almost entirely matrix multiplication. The query, key, and value projections, the attention output projection, and the two feed-forward layers are all general matrix multiplies (GEMMs). The backward pass adds more GEMMs per layer. Profile a single training step and the matmul calls dominate wall-clock time — layer norm, softmax, and the optimizer update are rounding error next to them.

So when you set out to train a small language model in Swift on an Apple Silicon Mac — no PyTorch, no CUDA, just your own code — the performance question collapses into one kernel. A naive matmul runs at single-digit Gflop/s. The hardware can do orders of magnitude more. The Cocoa With Love walkthrough that traces this exact climb, from gigaflops to teraflops, is a useful map: the speedup is not one clever trick but a stack of them, each unlocking the next.

Matmul is the whole compute budget

A GEMM multiplies an m×k matrix by a k×n matrix. The work is O(m·n·k) multiply-adds, but the data is only O(m·k + k·n) elements. That ratio — arithmetic intensity — is high, and high arithmetic intensity means the operation should be compute-bound. Each value you load from memory gets reused many times, so a well-written kernel spends its time doing math, not waiting on RAM.

The naive three-loop implementation throws that away. Walk the standard i, j, k order: for each output element C[i][j], you dot a row of A with a column of B. The row of A is contiguous. The column of B is not — consecutive elements sit n floats apart in memory. Every step of the inner loop jumps a cache line, so you fault on nearly every access. A core that can sustain tens of Gflop/s instead crawls at a handful, bottlenecked entirely on memory latency.

The optimization ladder

The climb from Gflop/s to Tflop/s is a sequence of independent wins. Each is small in code and large in effect.

Reorder the loops. Switching from i, j, k to i, k, j makes the innermost loop walk a row of B and a row of C — both unit-stride. You touch the same data, the same total count of times, but now the hardware prefetcher keeps up. Reordering three loops, with no new code, often more than triples throughput.

Block for cache. Tiling splits the matrices into sub-blocks sized so an A tile, a B tile, and a C tile stay resident in L1 or L2 at once. Each loaded value is then reused across the whole tile before it is evicted. This is the step that converts the kernel from memory-bound back to compute-bound — the point of the entire exercise.

Vectorize. Apple Silicon's NEON registers are 128 bits — four fp32 lanes — and the fused multiply-add instruction does a multiply and an add together. Writing the inner block with Swift's simd_float4, or feeding the compiler a clean loop it can auto-vectorize, lets one instruction retire eight floating-point operations. A microkernel keeps a small grid of C accumulators live in vector registers and streams A and B through them.

Thread it. The output matrix partitions cleanly: separate tiles of C share no data dependency, so they run on separate cores with no locks. DispatchQueue.concurrentPerform spreads tiles across the performance and efficiency cores. On an 8-to-12-core M-series chip this scales close to linearly.

Reach for the GPU. Sustained fp32 teraflops are more than a CPU will give you, even fully threaded and vectorized. A Metal compute shader running the same tiled algorithm on the GPU — or the matrix units behind Apple's own frameworks — is what crosses the line.

Every step on this ladder reorders floating-point additions, and floating-point addition is not associative. A blocked, vectorized, multithreaded result will differ from the naive result in the last few bits. Check each optimization against the naive version with a relative tolerance, never exact equality — and benchmark with a warmup pass and steady-state timing, or thermal throttling and cold caches hand you numbers that mean nothing.

Know when to stop hand-writing kernels

Apple ships cblas_sgemm in the Accelerate framework, and it dispatches to the AMX matrix coprocessor — silicon built for exactly this. It already runs close to the chip's peak. MLX, Apple's array framework, wraps the unified-memory training story end to end. For production work, you use those.

The reason to write the kernel yourself anyway is to understand the climb: to see where the FLOPs go, to read a roofline plot and know which wall you are against, to feel why blocking matters. That knowledge transfers. It is the same reasoning you apply when a real training run is slow and the profiler points at a GEMM someone else wrote.

Working through a matmul kernel means many small, mechanical edits — swapping loop bounds, splitting a tile dimension, lifting an accumulator into a register variable — where one transposed index silently corrupts the result. An editor that tracks those edits and surfaces the diff cleanly is worth more here than on most code.

How far the climb goes

The numbers depend on the chip, the matrix sizes, and how far you push each step. The shape of the curve does not. A naive kernel sits in the low single-digit Gflop/s. Loop order and cache blocking together pull it up by more than an order of magnitude. SIMD and the FMA pipeline roughly double it again. Multithreading multiplies by the core count. The GPU is the final step that puts teraflop-class fp32 within reach on a laptop — the same hardware that started at a few Gflop/s.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.