DEV Community

Papers Mache
Papers Mache

Posted on

Stateless scheduler doubles LLM training speed

Fine‑tuning a 10 B‑parameter model on a single RTX 4090 feels like watching paint dry—most of the GPU sits idle while a handful of layers chew through memory, and the whole job stalls at a crawl. The bottleneck isn’t the raw FLOPs; it’s the rigid coupling between model weights and the slots you allocate on the device.

Pipeline parallelism was supposed to solve that, but conventional schedules bind each model stage to a fixed GPU. When a heavyweight head sits on one card, that card becomes the choke point and bubbles waste up to 30 % of the pipeline’s capacity [1]. The cache that powers autoregressive generation suffers a similar fate: each layer hoards its own key‑value memory, ballooning the footprint and throttling batch size.

RoundPipe breaks the binding entirely. “RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round‑robin manner, achieving a near‑zero‑bubble pipeline” [1]. In an eight‑RTX 4090 server it delivered 1.48–2.16 × the throughput of the strongest existing baselines when fine‑tuning models from 1.7 B to 32 B parameters [1]. The paper also shows LoRA‑based fine‑tuning of a 235 B‑parameter model with a 31 K token context on a single server, proving the scheduler scales far beyond the modest setups most hobbyists use.

The memory side of the equation gets a similar lift from stochastic KV routing. By training layers to attend either to their own cache or to a predecessor’s, the approach lets several depths share a single cache without losing information. “KV cache memory scales as : at 8K tokens, it drops from 1170 MB (baseline) to 293 MB …, a 4 × reduction. Decode throughput improves consistently, from 34.0 tok/s (baseline) to 41.6 tok/s … (+22 %) at 8K context, due to skipping / projections on non‑leader layers” [2]. The authors confirm that the technique “reduces memory consumption, enabling longer contexts and larger batch sizes” [2].

The results are impressive, yet they leave open questions. The reported speedups come from an eight‑GPU configuration; it remains unclear how much of the gain survives on a single‑card setup, which is what most independent developers run. The stochastic KV scheme is evaluated on standard benchmarks, but its impact on niche domains or on models that already employ aggressive quantisation has not been explored. Moreover, the round‑robin dispatch assumes roughly homogeneous devices—heterogeneous clusters might re‑introduce imbalance.

If you already struggle to squeeze a 7 B model into 24 GB of VRAM, trying RoundPipe’s open‑source library could let you push the same hardware closer to its theoretical ceiling before you need to shard further. Pairing it with depth‑wise KV sharing may free enough memory to double batch size or stretch context windows without sacrificing latency. The safest path is to profile your specific workload: measure token‑per‑second with and without the scheduler, then repeat after enabling stochastic KV routing. The numbers will tell whether the stateless pipeline delivers its advertised double‑speed on your own rig, or whether you need to add a second GPU to reap the full benefit.

References

  1. Efficient Training on Multiple Consumer GPUs with RoundPipe
  2. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Top comments (0)