Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

MegaTrain: Training 100B+ Parameter LLMs on a Single GPU (And Why I Had to Close My Laptop)

#english #technology #machinelearning #llms

I was processing the Scion paper at 11pm — already wrote about it here — when my feed serves me a title I had to read twice: "MegaTrain: Full Precision Training of LLMs with 100B+ Parameters on a Single GPU".

First reaction: obvious clickbait. Second reaction: academic clickbait, which is worse because it comes with an abstract and everything. Third reaction, after the first three paragraphs of the actual paper: I closed the laptop and went to get some water.

Not because MegaTrain changes my work tomorrow. It doesn't. But some papers don't teach you a technique — they shift your conceptual ground. This is one of those. What follows is me trying to process out loud what it means for hardware to stop being the excuse.

MegaTrain full precision training single GPU: what the paper actually proposes

The base problem is well understood: training large LLMs requires distributing the model across dozens or hundreds of GPUs because the parameters, gradients, and optimizer states simply don't fit in the VRAM of a single card. A 70B model in full precision (FP32) needs roughly 280GB just for the parameters. An H100 has 80GB. The math doesn't work.

The industry's standard answer was: more GPUs, more interconnect, more money. DeepSpeed's ZeRO helped distribute state more efficiently, but the fundamental scaling problem stayed the same — you need the cluster or you don't train.

MegaTrain attacks this from a different angle. The core proposal is what they call a memory-time tradeoff taken to the extreme: instead of having all parameters active in VRAM simultaneously, the system streams parameters from CPU RAM (or NVMe storage) to the GPU at the exact moment they're needed for the forward and backward pass — and discards them afterward.

This isn't a new concept. Gradient checkpointing has existed for years and does something similar with activations. What MegaTrain does differently:

Layer-level granularity: It doesn't work with the full model at once — it goes layer by layer, with intelligent prefetching so the GPU never sits waiting.
Full precision without compromise: Unlike techniques like QLoRA that reduce precision to fit in memory, MegaTrain keeps full FP32 or BF16 on the active parameters.
Optimizer states on CPU: AdamW for 100B parameters needs to store momentum and variance — that's twice the parameters in memory. MegaTrain keeps those on CPU RAM and syncs them per layer.
Aggressive overlap: While the GPU is computing the forward pass for layer N, the system is already pulling the parameters for layer N+1 from CPU.

# Conceptual pseudocode for how MegaTrain handles parameter streaming
# This is NOT the real paper code — it's my interpretation to understand the flow

class MegaTrainLayer:
    def __init__(self, layer_params_on_cpu, optimizer_state_on_cpu):
        # Parameters live in CPU RAM, not VRAM
        self.params_cpu = layer_params_on_cpu
        self.optimizer_state = optimizer_state_on_cpu
        self.params_gpu = None  # Only exists when this layer is active

    def prefetch(self):
        """Start moving parameters to GPU in the background"""
        # This runs in parallel while the previous layer is computing
        self.params_gpu = self.params_cpu.to('cuda', non_blocking=True)

    def forward(self, x):
        """Compute with parameters already on GPU"""
        assert self.params_gpu is not None, "Did you call prefetch first?"
        result = compute(x, self.params_gpu)
        return result

    def evict(self):
        """Free VRAM — we don't need this layer for now"""
        # Backward will need them again, but we'll pull them back then
        del self.params_gpu
        self.params_gpu = None
        torch.cuda.empty_cache()

    def optimizer_step(self, gradients):
        """Optimizer step happens on CPU with transferred gradients"""
        # Gradients travel from GPU to CPU
        grads_cpu = gradients.to('cpu')
        # AdamW on CPU — slower per operation but uses zero VRAM
        update_params_cpu(self.params_cpu, grads_cpu, self.optimizer_state)

The result they report: training a GPT-3 scale model (175B parameters) on a single A100 80GB. The throughput is significantly slower than a distributed cluster — nobody's claiming it's fast. But it works, and in full precision.

Why this isn't "just another memory optimization technique"

This is the part that made me get up and walk around.

There's an implicit frame through which everyone thinks about large LLM training: it's an enterprise infrastructure problem. Google does it, Meta does it, Anthropic does it. You and I use the models they publish, or we fine-tune with LoRA on smaller models. Training a large base model from scratch is off the map of what a single person can do.

MegaTrain doesn't give you cluster speed. But it gives you access. And that's a categorically different thing.

Think about it this way: the difference between training a 100B model in 30 days on a single GPU versus never being able to do it at all is infinite. The difference between 30 days and 3 days on a cluster is a 10x factor. The first gap is categorically different in kind.

Who actually benefits from this?

Researchers without access to corporate compute: A university might have one or two H100s. With MegaTrain, that's enough to do real science at real scale.
Small companies that want their own models: Not everyone needs fast training. If you're training a model every six months on proprietary data, 30 days of compute on one GPU is a reasonable cost.
Experimentation before scaling: Validating that an architecture actually works before committing the cluster budget.

None of this is my immediate situation. But the direction matters.

It's the same feeling I had when the first LoRA papers dropped: in the moment, I had no concrete use for it, but I understood that something had moved. Two years later, accessible fine-tuning is the daily bread of the entire community. I find myself wondering whether MegaTrain — or its descendants — will be that same inflection point.

The real gotchas the paper doesn't put in the title

A moment of honesty: the paper is impressive, but there are things you have to read between the lines.

Speed is the elephant in the room. Token throughput is drastically lower than conventional distributed training. The paper acknowledges this — it doesn't hide it — but it also doesn't put the numbers in the headline. If you need to iterate fast, this isn't for you.

CPU-GPU bandwidth is the real bottleneck. PCIe 4.0 x16 has ~32 GB/s of theoretical bandwidth. In practice, parameter streaming is going to saturate that bus. GPUs with NVLink or unified memory architectures (like Apple's M2 Ultra) completely change this equation — something the paper mentions as future work.

Abundant CPU RAM is non-negotiable. If the model has 400GB of parameters plus optimizer states, you need that RAM in CPU. A workstation with 512GB of RAM isn't cheap. It's not a $50k cluster, but it's also not your home desktop.

Checkpointing and crash recovery get complicated. With state distributed between CPU and GPU in non-conventional ways, saving and recovering training state requires extra work.

# Approximate requirements for training a 100B model with MegaTrain
# (estimates based on the paper, not official production numbers)

# Model parameters in FP32: 100B * 4 bytes = 400 GB
# Optimizer states (AdamW, momentum + variance): 100B * 8 bytes = 800 GB
# Estimated total CPU RAM needed: ~1.2 TB
# Active GPU VRAM (current layer + buffers only): ~40-60 GB

# How much RAM does your server have?
free -h
# If you see less than 512GB, the full 100B scenario doesn't apply to you
# But it does apply for smaller models (13B, 30B) on more accessible hardware

This doesn't invalidate the technique — it reframes it. It's not "anyone can train GPT-4." It's "high-end hardware short of a datacenter cluster is now enough for a scale that was previously impossible."

That's an important distinction.

How this connects to the stack I use every day

Reality check: I'm not going to train a 100B LLM tomorrow. I work with Next.js, Docker, PostgreSQL, Railway — like when I looked at my own codebase with Google Maps and got a little scared. Base model training isn't my job.

But there's a conversation this paper changes for me as an application developer:

The "we don't have the compute for that" argument gets weaker. When I'm designing systems that use LLMs — or talking with clients about what's actually possible — the map of what requires Google-scale infrastructure and what doesn't is shifting. Fast.

I've lived this before at other layers of the stack. When I built the SSL certificate viewer for VS Code or the HAProxy extension, the logic was: why do I need to leave the editor for this? Democratization of tools that used to require specialized setup.

MegaTrain is that same logic applied at a much larger scale. The question "do I need a cluster to train this?" is going to have more negative answers in the coming years.

I also think about ML accessibility, not just software accessibility — and I've already written about how accessibility scores can lie to you in ways that actually matter. The "accessibility" of model training has the same problem: the reference metrics (clusters, cost, time) don't capture what really matters for different use cases.

Vibe-coding with AI already changed how I work. The question is what happens when the AI tools themselves — including training the models that power them — follow the same democratization path.

FAQ: MegaTrain and single-GPU training

Is MegaTrain open source and can I use it today?
The paper has been published but at the time of writing this, the full code isn't publicly available in a production-ready state. The concepts are implementable — several people in the community are already experimenting with their own implementations based on the paper. Follow the authors on arXiv and GitHub for updates.

Does it work with any GPU or do I need an H100?
Technically it works with any modern CUDA GPU, but PCIe bandwidth and available VRAM limit what model size is actually practical. On an RTX 4090 (24GB VRAM) you can work with models significantly smaller than 100B. The H100 with 80GB VRAM gives more headroom for layer buffering. GPUs with unified memory like the Apple Silicon series are an interesting case that the paper flags as a future direction.

Is it comparable in speed to training on a GPU cluster?
No. Let's be clear: training throughput is drastically slower. If a cluster of 64 A100s trains a model in a week, MegaTrain on a single GPU could take months. The value isn't in speed — it's in access. Being able to do something that was previously impossible without a cluster, even if it's slow.

Does this replace LoRA or QLoRA for fine-tuning?
They're tools for different problems. LoRA and QLoRA are for efficient fine-tuning of existing pre-trained models — and they're still the right answer for that use case. MegaTrain is for base training (pre-training) or full parameter training from scratch. If you want to adapt Llama 3 to your domain, LoRA is still the answer. If you want to train a model from zero on your own data, MegaTrain opens doors.

How much CPU RAM do I realistically need?
Depends on model size. The rough rule: parameters in FP32 (4 bytes per param) + AdamW optimizer states (8 additional bytes per param) + overhead. For a 13B parameter model: ~13B * 12 bytes ≈ 156GB CPU RAM. For 70B: ~840GB. For 100B: ~1.2TB. This makes CPU RAM the actual access bottleneck more than the GPU in many cases.

Does this have implications for privacy and proprietary data?
Yes, and it's a big one. One of the real frictions of training models on sensitive data is that you need cloud infrastructure, which means your data leaves your network. If MegaTrain makes training viable on on-premise hardware without a cluster, the business case for models trained on proprietary data in a controlled environment gets considerably stronger. For sectors like healthcare, finance, or legal — this isn't a minor footnote.

What I'm taking away from this

I won't use MegaTrain next week. Probably not next year with my current stack either.

But some papers don't give you a tool — they change how you map what's possible. This is one of those. The first time I saw a language model run on CPU, I thought "interesting but useless." Two years later, that's the foundation of how millions of people run local LLMs.

Hardware has stopped being the definitive excuse for not doing real science at real scale. That has consequences that will take time to unfold, but the direction is clear.

For me, having to get up and walk around when I read it is enough. That doesn't happen often.

If you want to read the original paper, search arXiv for "MegaTrain full precision training." It's worth your time.

Have you already read it? Or do you have an implementation running? I'd genuinely like to know what you found.

DEV Community