This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.
LM Studio is one of the most hardware-aware LLM frontends available. Unlike tools that run the same inference backend regardless of platform, LM Studio selects its backend based on what hardware it detects: MLX on Apple Silicon, CUDA on NVIDIA, and Metal as an Intel Mac fallback. This means a Mac M4 Pro running LM Studio gets meaningfully better performance than the same hardware running a tool defaulting to llama.cpp's CPU path.
That backend selection decision is what this guide is built around.
Quick answer: For NVIDIA desktop builds, the RTX 4090 (24GB) handles 34B models smoothly and the RTX 4060 Ti 16GB is the budget entry point for 13B at full quality. For Apple Silicon, the M4 Pro 24GB is the minimum for comfortable 13B use, and M4 Max 48GB+ handles 34B. The used RTX 3090 (24GB) remains the strongest VRAM-per-dollar option if you find one at a good price.
See the recommended pick on the original guide
How LM Studio picks its backend
This matters because it directly affects performance, and it's what separates LM Studio from other local inference tools.
Apple Silicon: LM Studio defaults to MLX, Apple's native machine learning framework for Apple chips. MLX uses the unified memory architecture of M-series chips efficiently — the same memory pool serves both CPU and GPU, meaning a MacBook Pro M4 Max with 48GB has 48GB available to the model with no VRAM ceiling separate from system RAM. MLX performance on Apple Silicon is significantly faster than running llama.cpp CPU inference, and in many cases faster than GPU-offloaded llama.cpp as well.
Before LM Studio made MLX the default on Apple Silicon, tools like earlier versions of Ollama defaulted to llama.cpp — which would use CPU inference unless explicitly configured for GPU offloading. LM Studio's automatic MLX backend is why Mac LLM performance for many users changed overnight when they switched frontends, not hardware.
NVIDIA GPUs: LM Studio uses CUDA-accelerated llama.cpp or its own CUDA inference path. Full GPU acceleration with VRAM management, quantization selection, and model splitting if needed.
Intel Mac / no supported GPU: Falls back to Metal or CPU inference via llama.cpp. Functional but significantly slower — not a recommended primary platform for LLM inference.
VRAM requirements by model size in LM Studio
LM Studio's quantization selector makes VRAM requirements variable. Here's a practical guide to what fits where:
| Model size | Q4 quantization | Q8 quantization | Full precision (FP16) |
|---|---|---|---|
| 7B | ~4.5GB | ~8GB | ~14GB |
| 13B | ~7.5GB | ~14GB | ~26GB |
| 34B | ~20GB | ~35GB | ~68GB |
| 70B | ~40GB | ~70GB | ~140GB |
For LM Studio on NVIDIA: if a model's quantized size fits in VRAM, it runs fully on GPU. If it doesn't fit, LM Studio can split layers across GPU and CPU — but layers running on CPU are dramatically slower. The practical target is fitting the entire model in VRAM for acceptable generation speed.
For Apple Silicon: unified memory means the 7B Q4 / 13B Q4 / 34B Q4 question is just about total system memory, not a separate VRAM limit. This is the architectural advantage.
For more on VRAM sizing principles, see how much VRAM do you need for local LLM.
NVIDIA picks for LM Studio
RTX 4090 (24GB) — best NVIDIA option:
24GB handles 13B models at Q8 or FP16, 34B models at Q4 and Q5, and provides fast generation on 7B models. LM Studio's CUDA path with 24GB means no model splitting on mainstream LLMs in 2026 — everything runs fully on GPU at comfortable speeds. Community users report 25–40 tokens/second for 13B Q4 on RTX 4090, which is fast enough for productive use.
See the recommended pick on the original guide
RTX 4060 Ti 16GB — best budget 13B card:
16GB is the sweet spot for 13B model users. The RTX 4060 Ti 16GB at around $400 fits 13B Q8 (14GB) with margin, and handles 34B Q4 (20GB) with minor layer splitting. For users primarily running 7B and 13B models, this card handles LM Studio workloads well. Generation speed is slower than the 4090 due to lower bandwidth (288 GB/s vs 1,008 GB/s), but fully functional. See best GPU for 13B models for a detailed comparison.
See the recommended pick on the original guide
Used RTX 3090 (24GB) — best VRAM-per-dollar:
If you're willing to buy used, the RTX 3090 offers 24GB GDDR6X — the same VRAM capacity as the RTX 4090 — at significantly lower prices on the secondhand market. Generation speed is noticeably slower than the 4090 (lower memory bandwidth), but for users whose bottleneck is VRAM capacity rather than raw throughput, the 3090 gives 34B model compatibility at a fraction of 4090 pricing. LM Studio runs cleanly on RTX 3090 with full CUDA support.
Apple Silicon picks for LM Studio
The MLX backend makes Apple Silicon uniquely competitive for LLM inference in LM Studio. The math is straightforward: unified memory means no separate VRAM ceiling, and MLX performance on M-series chips is fast enough that M-series Macs can outperform lower-VRAM NVIDIA cards for certain model sizes.
M4 Pro 24GB — minimum for 13B:
The M4 Pro with 24GB unified memory handles 13B Q8 comfortably and 34B Q4 with performance. 24GB is the practical minimum for productive 13B work — 16GB unified memory (base M4 Pro) is sufficient for 7B but cramped for 13B Q8. LM Studio's MLX path on M4 Pro gives smooth generation that would require an RTX 4060 Ti or better on the NVIDIA side. Community comparisons put M4 Pro 24GB roughly equivalent to an RTX 4070 for 13B inference through LM Studio.
M4 Max 48GB+ — for 34B models:
48GB unified memory handles 34B Q8 and is the entry point for comfortable 34B use. M4 Max with 48GB sits in a unique position: no NVIDIA consumer card reaches 48GB VRAM. The RTX 4090 maxes out at 24GB; fitting a 34B Q8 model (35GB) requires either a Mac or a workstation-class card. For users who want 34B models at full quality without workstation GPU pricing, M4 Max 48GB is the most accessible option.
M3 Ultra / M4 Ultra 192GB — for 70B+ models:
Ultra-class chips with 192GB unified memory can run 70B models at Q8 and 34B at full precision — configurations that aren't possible on any consumer NVIDIA GPU. LM Studio's MLX backend exploits this fully. For users who need 70B-class performance locally without a multi-GPU server setup, the M3 or M4 Ultra is the only consumer-accessible path. The price is workstation-level, but the capability is genuine.
For a full head-to-head comparison of these platforms, see Mac vs NVIDIA for LLM.
Which GPU for LM Studio?
- You run 7B models, budget build: RTX 3060 12GB or RTX 4060 8GB handles 7B Q4/Q8 fully in VRAM. Not comfortable for 13B.
- You run 7B–13B models, NVIDIA desktop: RTX 4060 Ti 16GB (~$400) is the right call — 16GB fits 13B Q8, every 7B fits easily.
- You run 34B models, NVIDIA: RTX 4090 (24GB) or used RTX 3090 (24GB). 24GB fits 34B Q4/Q5 fully in VRAM.
- You're on Apple Silicon, running 13B: M4 Pro 24GB minimum. 16GB is workable but cramped.
- You're on Apple Silicon, running 34B: M4 Max 48GB+. This is the only accessible path to 34B Q8 on a single consumer device.
- You run 70B models: M3/M4 Ultra (192GB) or multi-GPU NVIDIA setup. No single consumer NVIDIA card handles 70B on its own.
- You want to explore models without committing: LM Studio's model browser and built-in chat interface make it ideal for this. Use LM Studio for exploration, then move to Ollama for production automation.
Why LM Studio is worth using even on NVIDIA
Several GPU buyers default to Ollama because it has better automation and API support. That's a valid workflow — but LM Studio offers something distinct that makes it worth running alongside Ollama:
Model browser: LM Studio has a built-in model discovery interface connected to HuggingFace. You can browse, filter by size and quantization, and download directly. No manual HuggingFace navigation or CLI commands.
Built-in chat interface: A polished chat UI with conversation history, system prompt editing, and context length controls. Better than Ollama's default web UI for interactive use.
Quantization comparison: LM Studio makes it easy to test the same model at Q4, Q5, Q6, and Q8 side-by-side and assess quality vs speed trade-offs with your actual VRAM. This is valuable during the exploration phase when you're deciding what model to run long-term.
LM Studio as exploration, Ollama for production: The common pattern among experienced local LLM users is to use LM Studio to explore new models and find quantizations that work well, then export the model path to Ollama for API-accessible, automation-friendly production use. LM Studio has an Ollama-compatible server mode that bridges this workflow. See best GPU for Ollama for Ollama-specific guidance, and best GPU for Open WebUI if you plan to put a browser chat interface in front of that Ollama backend.
LM Studio system requirements
LM Studio's official documentation notes that CUDA 11.8+ is required for NVIDIA GPU acceleration on Windows and Linux. Apple Silicon requires macOS 13.6+ for MLX support. For optimal MLX performance on Mac, running the latest available macOS version is recommended as Apple ships MLX optimizations through OS updates.
GPU memory requirements are model-dependent — LM Studio displays available VRAM and flags whether your selected model fits before loading, which makes it more user-friendly than tools that discover VRAM limits at runtime.
See the recommended pick on the original guide
For broader LLM hardware context, see how much VRAM for local LLM and best GPU for Llama 4.
Frequently Asked Questions
What are LM Studio's GPU requirements?
LM Studio requires CUDA 11.8 or newer for NVIDIA GPU acceleration on Windows and Linux. Any NVIDIA GPU with 8GB+ VRAM can run 7B models. For Apple Silicon, macOS 13.6+ is required for MLX support. LM Studio displays whether your GPU has enough VRAM before loading a model, so you can check compatibility before downloading.
Does LM Studio support multiple GPUs?
LM Studio can split model layers across multiple NVIDIA GPUs when a single card does not have enough VRAM. However, multi-GPU support is not as seamless as single-GPU use — you may need to manually configure layer allocation, and inter-GPU communication adds some overhead. For most users, a single high-VRAM card like the RTX 4090 is simpler and often faster than two smaller cards.
How much VRAM does LM Studio need?
VRAM needs depend on the model size and quantization level. For 7B models at Q4, you need about 6GB. For 13B models at Q4, about 10GB. For 34B models at Q4, about 22GB. LM Studio also uses VRAM for the KV cache during conversations, so budget an extra 2-4GB beyond the base model size for comfortable context lengths.
Does LM Studio work on Apple Silicon with MLX?
Yes, and it is one of LM Studio's biggest advantages. LM Studio automatically selects the MLX backend on Apple Silicon Macs, which uses unified memory efficiently. An M4 Pro with 24GB handles 13B models well, and an M4 Max with 48GB runs 34B models comfortably. MLX performance on Apple Silicon often matches or exceeds mid-range NVIDIA GPUs for equivalent model sizes.
Related guides on Best GPU for LLM
- Best Budget GPU for Local LLM in 2026 (Under $350)
- Best GPU for Continue.dev (Local AI Coding) in 2026
- Best GPU for Gemma 2B-27B in 2026 (6 Picks Ranked)
The full version lives on Best GPU for LLM — VRAM calculator, GPU comparison table, and live Amazon pricing.
Top comments (0)