DEV Community: Pranav Sateesh

Mosaic: Sharding Attention Across GPUs When Your Sequence Doesn't Fit

Pranav Sateesh — Mon, 05 Jan 2026 06:53:43 +0000

How we built a lightweight library to distribute 150,000-token attention across multiple GPUs

The Problem: Attention Doesn't Scale

You've probably heard that transformers have a "quadratic attention bottleneck." Here's what that actually means in practice.

Attention computes:

Attention(Q, K, V) = softmax(QKᵀ / √d) × V

The killer is QKᵀ — a matrix of shape (sequence_length × sequence_length). For a 150,000-token sequence:

Memory = 150,000² × 4 bytes = 90 billion bytes = 84 GB

That's just for the attention weights. One layer. One head. An A100 has 80GB total.

You can't fit it.

Existing Solutions (and Their Limits)

FlashAttention reduces memory from O(n²) to O(n) by computing attention in tiles without materializing the full matrix. But it still requires the entire sequence on one GPU.

Ring Attention (from ring-flash-attn) shards the sequence across GPUs. Each GPU holds a chunk of Q and passes K, V around in a ring. Beautiful for 1D sequences.

The gap: What about models with multiple attention patterns?

Consider a tabular transformer with shape (batch, rows, features, embed):

Attention over features (axis 2): 5 tokens — fits easily
Attention over rows (axis 1): 150,000 tokens — needs sharding

No library handled this cleanly. You'd write custom code for each axis, manage different process groups, handle the tensor reshaping yourself.

Mosaic: Multi-Axis Attention Sharding

Mosaic is a thin coordination layer that routes different attention axes to appropriate backends:

import mosaic

# Small axis: run locally
feature_attn = mosaic.MultiAxisAttention(
    embed_dim=96, num_heads=4,
    attention_axis=2,    # features dimension
    backend="local"      # no communication needed
)

# Large axis: shard across GPUs
row_attn = mosaic.MultiAxisAttention(
    embed_dim=96, num_heads=4,
    attention_axis=1,    # rows dimension  
    backend="ring"       # ring attention across GPUs
)

That's it. Mosaic handles:

Permuting the attention axis to the sequence position
Reshaping for QKV projection
Dispatching to the right backend
Restoring the original tensor shape

How Ring Attention Works

The key insight: you don't need all of K and V at once. You can compute partial attention scores, accumulate them, and normalize at the end.

4 GPUs, sequence split into 4 chunks:

Initial state:
  GPU 0: Q₀, K₀, V₀
  GPU 1: Q₁, K₁, V₁
  GPU 2: Q₂, K₂, V₂
  GPU 3: Q₃, K₃, V₃

Step 1: Each GPU computes attention with its local K, V
  GPU 0: score₀₀ = Q₀ @ K₀ᵀ
  ...

Step 2: Pass K, V to the next GPU in the ring
  GPU 0 receives K₃, V₃ from GPU 3
  GPU 0 sends K₀, V₀ to GPU 1

Step 3: Compute attention with received K, V
  GPU 0: score₀₃ = Q₀ @ K₃ᵀ
  Accumulate with score₀₀

Repeat for all chunks...

Final: Each GPU has complete attention output for its Q chunk

Memory per GPU: O(n²/p) where p = number of GPUs

With 8 GPUs, you've reduced memory 8×. A 150k sequence now needs ~10GB per GPU instead of 84GB.

Beyond 1D: Mesh2D Attention

For very long sequences, even ring attention isn't enough. Mesh2D shards both Q and K:

4 GPUs in 2×2 mesh:

         K₀      K₁
       ┌──────┬──────┐
    Q₀ │GPU 0 │GPU 1 │
       ├──────┼──────┤
    Q₁ │GPU 2 │GPU 3 │
       └──────┴──────┘

Each GPU computes one tile of QKᵀ

Memory per GPU: O(n²/p²)

With 64 GPUs in an 8×8 mesh, memory drops 64× per GPU.

attn = mosaic.MultiAxisAttention(
    embed_dim=128, num_heads=8,
    attention_axis=1,
    backend="mesh2d",
    mesh_shape=(8, 8)
)

Composed Strategies

Real clusters have topology. GPUs within a node communicate via fast NVLink (900 GB/s). GPUs across nodes use slower InfiniBand (200 GB/s).

Mosaic's ComposedAttention exploits this:

# 4 nodes × 8 GPUs = 32 total
composed = mosaic.ComposedAttention(
    mesh_shape=(4, 8),       # (nodes, gpus_per_node)
    head_parallel=True,      # Split heads across nodes (slow link)
    seq_parallel="ring"      # Ring within nodes (fast link)
)

Or use HierarchicalAttention for explicit control:

hier = mosaic.HierarchicalAttention(
    intra_node_size=8,
    intra_node_strategy="local",  # Compute locally within node
    inter_node_strategy="ring"    # Ring between node leaders
)

The Implementation

Mosaic is ~800 lines of Python. Here's the core pattern:

class MultiAxisAttention(nn.Module):
    def forward(self, x):
        # 1. Move attention axis to seq position
        x, inv_perm = self._permute_to_seq(x)

        # 2. Flatten batch dims, project QKV
        x = x.view(-1, seq_len, embed_dim)
        qkv = self.qkv_proj(x).view(batch, seq, 3, heads, head_dim)
        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(0)

        # 3. Dispatch to backend
        out = self._attn_fn(q, k, v)  # local, ring, or mesh2d

        # 4. Project output, restore shape
        out = self.out_proj(out.transpose(1, 2).reshape(...))
        return out.permute(inv_perm)

The backends wrap existing libraries:

local: F.scaled_dot_product_attention (FlashAttention)
ring: ring_flash_attn_func from ring-flash-attn
mesh2d: Custom all-gather + SDPA

All use FlashAttention kernels for the actual attention computation.

Usage

pip install git+https://github.com/stprnvsh/mosaic.git

# With ring attention support
pip install flash-attn ring-flash-attn

Single node:

torchrun --nproc_per_node=4 train.py

Multi-node:

# Node 0
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
    --master_addr=192.168.1.100 --master_port=29500 train.py

# Node 1  
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
    --master_addr=192.168.1.100 --master_port=29500 train.py

Training script:

import mosaic
import torch.distributed as dist

dist.init_process_group("nccl")
ctx = mosaic.init(sp_size=dist.get_world_size())

model = MyModel().to(ctx.device)

# Data is pre-sharded: each GPU has seq_total / world_size tokens
x_local = load_my_shard()
out = model(x_local)  # Communication handled by Mosaic

When to Use What

Sequence	GPUs	Backend	Memory/GPU
< 10k	1	`local`	O(n²)
10k–100k	2–8	`ring`	O(n²/p)
100k–1M	8–64	`mesh2d`	O(n²/p²)
> 1M	64+	`composed`	O(n²/(p²·h))

Performance

We optimized for zero overhead:

FlashAttention everywhere — All backends use F.scaled_dot_product_attention for fused GEMM + softmax
Pre-selected dispatch — Backend function bound at init, no branching in forward
View not copy — x.view() instead of x.reshape() when contiguous
Pre-allocated collectives — all_gather into pre-sized tensors, no torch.cat
Module-level imports — No import overhead per forward pass

What Mosaic Is Not

Mosaic doesn't:

Auto-parallelize your model (use nnScaler for that)
Handle data parallelism (use PyTorch DDP/FSDP)
Manage model sharding (use FSDP or Megatron)

It does one thing: route multi-axis attention to the right sharding backend.

The Origin Story

This came from profiling nanoTabPFN, a transformer for tabular data. The model has attention over both rows (150k) and features (5). Standard ring attention doesn't understand "rows" vs "features" — it just sees a sequence dimension.

We needed:

Local attention for small axes
Ring attention for large axes
Clean axis routing without rewriting the model

Mosaic is the result.

Code: github.com/stprnvsh/mosaic

Dependencies: PyTorch 2.0+, NCCL, optionally flash-attn + ring-flash-attn

Our GPU Was Idle 77% of the Time. Here's How We Fixed It

Pranav Sateesh — Sat, 03 Jan 2026 19:00:16 +0000

A practical guide to eliminating data transfer bottlenecks in PyTorch — achieving 1.5x speedup with pinned memory, CUDA streams, and GPU Direct Storage.

We assumed the GPU was our bottleneck. We were wrong.

While training a transformer model, I noticed something strange in the profiler output: the CPU was spending 77% of its time on cudaMemcpyAsync. Our expensive A100 GPU wasn't compute-bound. it was starving for data.

This post covers how we diagnosed the problem, fixed it with three increasingly aggressive optimizations, and hit the next wall. If you're training models on large datasets and haven't profiled your data pipeline, you might be leaving significant performance on the table.

The Setup

We're training nanoTabPFN, a transformer for tabular data. Training data lives in HDF5 files: 30,000 samples, each with 5,000 rows and 5 features. Hardware: NVIDIA A100-SXM4–80GB.

The original data loading code was textbook PyTorch:

with h5py.File(filename, "r") as f:
    for step in range(num_steps):
        x = torch.from_numpy(f["X"][ptr:end])
        y = torch.from_numpy(f["y"][ptr:end])
        yield dict(x=x.to(device), y=y.to(device))

Simple. Correct. And devastatingly slow.

Profile First, Optimize Later

Before touching any code, we ran PyTorch's built-in profiler:

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as prof:
    train(model, prior)

print(prof.key_averages().table(sort_by="cpu_time_total"))

The results were shocking:

Operation	CPU Time	% of Total
`cudaMemcpyAsync`	44,084ms	76.78%
`cudaMalloc`	7,081ms	12.33%
`cudaLaunchKernel`	645ms	1.12%
`aten::bmm`	180ms	0.31%

The GPU was doing matrix multiplications in milliseconds while the CPU spent 44 seconds copying data.

Understanding the Problem

The .to(device) call in PyTorch is synchronous by default. Here's the hidden pipeline:

h5py reads from disk → CPU memory (pageable)
PyTorch allocates → CPU staging buffer
cudaMemcpy → GPU memory (blocks until complete)
GPU computes → while CPU waits for step 1

The GPU sits idle during steps 1–3. With 5,000-row samples at float32, each batch transfer is ~120MB. That's 12GB of sequential transfers over 100 steps.

Fix #1: Pinned Memory + Non-blocking Transfers

The first optimization: use page-locked (pinned) memory with async transfers.

# Before: synchronous, pageable memory
x = torch.from_numpy(x_np).to(device)

# After: pinned memory, async transfer
x = torch.from_numpy(x_np).pin_memory().to(device, non_blocking=True)

Why this works: Pinned memory is DMA-accessible — the GPU can read it directly without CPU intervention. Combined with non_blocking=True, the transfer happens in the background while the CPU continues working.

Impact: cudaMemcpyAsync time dropped from 44s to ~4s.

Fix #2: CUDA Streams for True Overlap

Non-blocking transfers alone aren't enough. By default, operations on the same CUDA stream are serialized. We need a separate stream for data transfer:

class PriorDumpDataLoader:
    def __init__(self, ...):
        self.transfer_stream = torch.cuda.Stream()

    def __iter__(self):
        # Pre-fill buffer with first batches
        vram_buffer = [self._load_to_vram(f) for _ in range(prefetch)]

        for step in range(num_steps):
            batch = vram_buffer.pop(0)  # Already in VRAM

            # Prefetch next batch on separate stream
            with torch.cuda.stream(self.transfer_stream):
                next_batch = self._load_to_vram(f)
            vram_buffer.append(next_batch)

            # Sync before yielding
            torch.cuda.current_stream().wait_stream(self.transfer_stream)
            yield batch

This is double buffering: while the GPU processes batch N, the CPU+DMA engine load batch N+1. The GPU never waits.

Fix #3: GPU Direct Storage (GDS)

The ultimate optimization: bypass the CPU entirely.

NVIDIA's GPUDirect Storage reads directly from NVMe to GPU memory:

import kvikio
import cupy as cp

# Allocate GPU buffer
x_gpu = cp.empty((batch_size, seq_len, features), dtype=cp.float32)

# Direct read: NVMe → GPU (no CPU copy)
with kvikio.CuFile("data.bin", "r") as f:
    f.pread(x_gpu, file_offset=offset)

# Zero-copy to PyTorch
x = torch.as_tensor(x_gpu, device="cuda")

The catch: GDS requires raw binary files. HDF5 has headers that need CPU parsing. We added automatic conversion on first run:

def convert_h5_to_raw(h5_filename):
    with h5py.File(h5_filename, "r") as f:
        X = f["X"][:].astype(np.float32)
        y = f["y"][:].astype(np.float32)
    X.tofile(f"{base}_X.bin")
    y.tofile(f"{base}_y.bin")

Results

Metric	Baseline	Optimized	Speedup
Total time (100 steps)	68.75s	45.30s	1.52x
`cudaMemcpyAsync` CPU	44,084ms	268ms	164x
Steps/sec	1.5	2.2	1.47x

Memory transfer overhead dropped from 77% to <1% of CPU time.

The New Bottleneck

With data loading solved, the profile looks completely different:

Operation	CPU Time	% of Total
`Command Buffer Full`	23,450ms	46.91%
`cudaLaunchKernel`	10,733ms	21.47%
`cudaMalloc`	5,607ms	11.22%

The GPU is now saturated. "Command Buffer Full" means the GPU can't keep up with kernel submissions. This is exactly what we want — the GPU is the bottleneck, not data loading.

The remaining compute bottleneck is attention (aten::bmm at 45% CUDA time). With 5,000-row sequences, attention's O(n²) scaling dominates. Flash Attention is the next optimization.

Key Takeaways

Async is not automatic. non_blocking=True does nothing without proper stream management.

Pinned memory matters. 10x+ difference for large transfers.

GDS has constraints. True zero-copy requires raw binary files, GDS-compatible NVMe, and proper alignment.

Know when to stop. Once you're GPU-bound, data loading optimizations won't help. Move to model architecture changes.

Quick Reference

Technique	What it does	When to use
`pin_memory()`	Page-locked CPU memory	Always for GPU training
`non_blocking=True`	Async H2D transfer	With CUDA streams
CUDA Streams	Parallel transfer/compute	Large batch sizes
Double buffering	Prefetch next batch	I/O-bound workloads
GDS (kvikio)	Disk → GPU direct	Large sequential reads

Code

All code is available at github.com/stprnvsh/nanoTabPFN:

# Baseline
python train.py --profile --steps=100 --batch-size=6

# Optimized with GDS
python train_optimized.py --gds-bin --batch-size=4 --steps=200

# With Flash Attention
python train_optimized.py --flash --gds-bin --batch-size=8 --steps=200