Ajeet Singh Raina

Posted on Mar 8

Running NVIDIA Nemotron on a Mac with Docker Model Runner: What You Need to Know

#nvidia #nvidiachallenge #docker #llm

Docker Model Runner just got a major upgrade for Mac users. With the introduction of vllm-metal - a new backend that brings vLLM inference to macOS via Apple Silicon's Metal GPU - you can now run MLX models through the same OpenAI-compatible API, the same Claude Code-compatible API, and the same Docker workflow you already know.

I put this to the test by running NVIDIA Nemotron models on my Mac. Here's what the experience looks like, what works today, and what's coming very soon.

What is vllm-metal?

vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed by Docker engineers and now contributed to the open-source vLLM community, it unifies MLX (Apple's machine learning framework) and PyTorch under a single compute pathway — plugging directly into vLLM's existing engine, scheduler, and OpenAI-compatible API server.

The architecture is elegant:

+-------------------------------------------------------------+
|                          vLLM Core                          |
|        Engine | Scheduler | API | Tokenizers                |
+-------------------------------------------------------------+
                             |
+-------------------------------------------------------------+
|                   vllm_metal Plugin Layer                   |
|   | Platform  |  | Worker    |  | ModelRunner            |  |
+-------------------------------------------------------------+
                             |
+-------------------------------------------------------------+
|                   Unified Compute Backend                   |
|   | MLX (Primary inference) | PyTorch (model loading)  |   |
+-------------------------------------------------------------+
                             |
+-------------------------------------------------------------+
|              Metal GPU / Apple Silicon Unified Memory       |
+-------------------------------------------------------------+

Source ~ https://www.docker.com/blog/docker-model-runner-vllm-metal-macos/

What makes this particularly powerful on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations — combined with paged attention for KV cache management and Grouped-Query Attention support.

Getting Started

Update to Docker Desktop 4.62 or later for Mac, then install the backend:

docker model install-runner --backend vllm-metal

That's it. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed.

vLLM Now Runs Everywhere with Docker Model Runner

This release completes vLLM support across all three major platforms:

Platform	Backend	GPU
Linux	vllm	NVIDIA (CUDA)
Windows (WSL2)	vllm	NVIDIA (CUDA)
macOS	vllm-metal	Apple Silicon (Metal)

The same docker model commands work regardless of platform. Docker Model Runner picks the right backend automatically.

Which Models Work with vllm-metal?

vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some great starting points:

# Lightweight and fast
docker model run hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit

# Strong 7B
docker model run hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

# Latest coding model
docker model run hf.co/mlx-community/Qwen3-Coder-Next-4bit

Running NVIDIA Nemotron on Mac

With vllm-metal installed, let's look at what the Nemotron lineup offers on Mac today.

Llama-3.1-Nemotron-Nano-8B (Recommended)

This is a standard transformer model fine-tuned by NVIDIA for instruction following and reasoning. It runs well on Mac via vllm-metal:

docker model run hf.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1

On a 32GB or 64GB Mac, this model runs comfortably. On a 16GB Mac, free up unified memory first by reducing Docker Desktop's VM memory limit (Settings → Resources → Advanced → set to 4–5 GB).

Nemotron-3-Nano-30B: The Mamba2 Frontier

The 30B model uses a hybrid SSM (State Space Model) architecture — combining Mamba2 layers with attention layers. This is architecturally novel and represents the next frontier of efficient inference: better long-context performance, lower memory footprint at runtime, and compelling throughput characteristics.

Mamba2 support in mlx-lm is actively maturing, and as it does, models like Nemotron-3-Nano-30B will run natively through vllm-metal. This is an exciting space to watch — especially on higher-memory Macs where the 30B model size becomes practical.

The $599 AI Development Rig

One of the most compelling stories in this release: a base Mac Mini M4 at $599 is now a viable vLLM development environment. Because Apple Silicon uses unified memory, the 16GB (or upgraded 32GB/64GB) RAM is directly accessible by the GPU, enabling you to:

Develop and test locally using the same OpenAI-compatible API as production
Mirror production — the same API surface as an H100 cluster, on your desk
Run efficiently — a fraction of the power consumption and heat of a discrete GPU rig

This democratizes access to vLLM development in a way that wasn't possible before. Previously, getting started with high-throughput vLLM required an RTX 4090 ($1,700+) or enterprise-grade GPU cards. Now, the barrier to entry is a Mac Mini.

vllm-metal vs llama.cpp: Benchmark Context

Docker benchmarked both backends on Llama 3.2 1B Instruct with 4-bit quantization:

max_tokens	llama.cpp (tok/s)	vLLM-Metal (tok/s)
128	333.3	251.5
512	345.1	279.0
1024	338.5	275.4
2048	339.1	279.5

llama.cpp shows ~1.2–1.3x higher raw throughput on this benchmark. However, vllm-metal brings things that raw token speed doesn't capture: the full vLLM engine, including its scheduler, paged attention, batching, and production-grade OpenAI-compatible API. For developers building real applications, that consistency and API compatibility often matters more than peak throughput on a single request.

Hardware Recommendations

Mac Configuration	What to Run
M-series, 16GB	`mlx-community/Llama-3.2-1B-Instruct-4bit`, `ai/phi4-mini`
M-series, 32GB	Nemotron 8B, `mlx-community/Mistral-7B-Instruct-v0.3-4bit`
M-series Max/Ultra, 64GB+	Nemotron 8B comfortably, 30B models as SSM support matures

I'm upgrading to a MacBook Pro Max this month — full benchmarks and multi-model Nemotron demos coming soon.

Open Source at the Core

Docker contributed vllm-metal to the vLLM open-source community — it now lives under the vLLM GitHub organization. This means every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project has also had significant contributions from Lik Xun Yuan, Ricky Chen, and Ranran Haoran Zhang.

Quick Start

# Install vllm-metal backend (Docker Desktop 4.62+ required)
docker model install-runner --backend vllm-metal

# Run a Nemotron model
docker model run hf.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1

# Or try an mlx-community model
docker model run hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

# List all downloaded models
docker model ls

Learn More

Want to learn more? Come, join me at Collabnix Community Meet and NVIDIA GTC 2026 Watch Party this March 21st virtually. Don't forget to register here

DEV Community