Nvidia GreenBoost Lets You Fake More VRAM — And It Actually Kind of Works

#nvidia #opensource #ai #gpu

There's a project sitting at the top of Hacker News right now with 277 points and climbing. It's called Nvidia GreenBoost, and it does something NVIDIA would really rather you not think about: it transparently extends your GPU's VRAM by borrowing system RAM and NVMe storage.

A lone developer on GitLab built a CUDA shim — a thin layer that sits between your applications and the GPU driver — that makes your system RAM appear as additional VRAM. Your 8GB RTX 4060? GreenBoost can make it pretend it has 32GB. Your 12GB RTX 4070? Now it thinks it's got 64GB.

And here's the part that's making people lose their minds: it actually kind of works.

How It Works (And Why It Shouldn't Be This Easy)

GreenBoost operates as a CUDA interposer — essentially a man-in-the-middle between your CUDA applications and NVIDIA's driver stack. When a program requests GPU memory allocation, GreenBoost intercepts the call. If there's real VRAM available, it allocates there as normal. When VRAM runs out, it transparently redirects the overflow to system RAM, and optionally to NVMe-backed swap.

The application never knows the difference. As far as your AI inference engine or 3D renderer is concerned, the GPU has as much memory as you've configured GreenBoost to expose.

This is not a new concept. Virtual memory has been doing this for CPUs since the 1960s. What's remarkable is that NVIDIA hasn't done it themselves. The hardware supports PCIe-based memory access. The driver could implement transparent VRAM extension. NVIDIA simply... chose not to.

The cynical read is obvious: if your 8GB card could seamlessly use 32GB of system RAM as overflow, why would you buy a card with more VRAM?

The Performance Reality Check

Let's not pretend this is magic. System RAM is fundamentally slower than VRAM. A modern GPU like the RTX 4090 has about 1 TB/s of memory bandwidth to its onboard GDDR6X. Your DDR5 system RAM? Maybe 50-80 GB/s. PCIe 5.0 x16 tops out around 64 GB/s.

So when GreenBoost spills data to system RAM, anything that touches that data takes a massive hit. The numbers paint an honest picture:

Baseline (Ollama, pure system RAM overflow): 2-5 tokens per second. Usable for testing. Painful for actual work.
Optimized path (ExLlamaV3 + GreenBoost cache): 8-20 tokens per second. Now we're talking.

The key insight: most large language models don't access all their parameters equally. During inference, attention layers get hammered constantly, but many weight matrices are accessed infrequently. If you can keep the hot path in real VRAM and let cold overflow sit in system RAM, the performance penalty only materializes on rare cache misses.

The "almost fits" scenario is GreenBoost's killer use case. For a model that's just 2-3GB too large for your GPU's VRAM, GreenBoost automates the overflow — and when the overflow is small relative to total VRAM, it works remarkably well.

For Gaming? Not So Much

Gaming workloads are fundamentally different from AI inference. Games access textures in unpredictable, spatially-dependent patterns. There's no "hot path" to keep in VRAM — everything is potentially hot.

The result is constant thrashing between VRAM and system RAM, and the latency kills frame times. Even if average FPS looks okay, the frame-time spikes create nauseating stuttering.

AMD Already Does This. With a Kernel Parameter.

On AMD GPUs, you can already extend VRAM into system RAM with amdttm.pages_limit. Set it, reboot, done. AMD just... lets you do it.

AMD sees VRAM extension as a feature. NVIDIA sees it as a threat to their upsell strategy.

The Bigger Picture

GreenBoost matters not because of what it is — a clever memory trick — but because of what it represents. The GPU computing community is tired of artificial limitations. They're tired of being told that 8GB of VRAM is "enough" by a company that charges $2000 for 32GB.

Open source keeps finding the gaps between what hardware can do and what vendors let it do, and it keeps filling those gaps with code. GreenBoost is somewhere in between clever enough to work and rough enough to remind you it shouldn't have been necessary.

NVIDIA could ship this as a driver feature tomorrow. They have the engineers, the telemetry data, and the driver-level access that a userspace shim will never have. They won't do it. Because NVIDIA's business model depends on VRAM being a hard constraint.

Should You Use GreenBoost?

Running local AI inference with a model slightly too large for your VRAM? Yes. Clone the repo, build the shim, pair with ExLlamaV3 for best results.
Gaming? Skip it. Wait for unified memory architectures.
On AMD? Just use amdttm.pages_limit. You don't need a third-party shim.

Originally published on TechPulse Daily