Preface
Apple Silicon has rapidly emerged as a major platform for machine learning development and deployment. With unified memory architectures supporting up to 192 GB of shared CPU and GPU memory and memory bandwidth exceeding 400 GB/s, recent Mac devices provide substantial capability for running large language models locally. This has increased interest in efficient inference frameworks tailored specifically to Apple hardware, particularly for development workflows, privacy-sensitive applications, and edge deployment.
Existing inference solutions, however, present structural limitations. PyTorch’s MPS backend adapts CUDA-style execution to Metal but does not fully exploit the unified memory model. llama.cpp delivers strong performance for text-only models yet lacks support for multimodal architectures. vLLM-metal provides continuous batching but does not include multimodal execution or vision caching support. As a result, the current ecosystem remains fragmented.
Recent advances in MLX introduce a native execution framework specifically designed for Apple Silicon. By leveraging unified memory and Metal acceleration directly, MLX enables zero-copy tensor operations and optimized kernel execution. Models compiled with MLX demonstrate significantly higher throughput compared to standard PyTorch or GGUF-based builds.
On M-series systems, MLX-optimized builds of Qwen 3.5 achieve approximately 2x token generation speed relative to baseline implementations on identical hardware.
This article demonstrates how to deploy Qwen 3.5 using an MLX build through LM Studio, providing a simplified interface for high-performance local inference on Apple Silicon.

MLX Outperforms llama.cpp.
results show vllm-mlx consistently exceeds llama.cpp throughput by 21% to 87%. We attribute this to three factors:
MLX’s native unified memory design enables zero-copy tensor operations, avoiding the memory transfer overhead present in llama.cpp’s Metal backend;
MLX’s lazy evaluation allows operation fusion and reduces kernel launch overhead;
Our continuous batching scheduler maximizes GPU utilization by processing multiple sequences simultaneously.
Installation
LM Studio provides a graphical interface for downloading and running local models without manual configuration.
Enter Qwen 3.5 MLX in the search field.
Select a model(0.8B ,2B ,4B ,9B) explicitly labeled as MLX. - By default, LLM Studio is capable of displaying which models your device is capable of running.
Alternatives
This method uses MLX to run Qwen 3.5 models directly on Apple Silicon hardware. MLX is designed specifically for M-series chips and provides GPU acceleration through the Metal backend while fully exploiting unified CPU/GPU memory. This eliminates explicit memory transfers and improves execution efficiency.
MLX supports M1, M2, and M3 series devices.
Installing MLX on macOS
Inference support for Qwen models is provided via the mlx-lm Python package.
Requirements:
- macOS running on Apple Silicon
- Python 3.11 or newer
# Create a clean virtual environment (recommended)
python3 -m venv mlx-qwen
source mlx-qwen/bin/activate
# Install latest mlx-lm (handles text-only Qwen3.5 great)
pip install mlx-lm
# Optional: for vision-language Qwen3.5 models
pip install mlx-vlm
MLX is currently supported only on Apple Silicon MX-series systems.
Loading Qwen 3.5 Models with MLX
The 0.8B model is natively multimodal and highly efficient. You can run it in two primary modes:
For interactive:
mlx_lm.chat --model Qwen/Qwen3.5-0.8B
For server:
Use this to host an OpenAI-compatible API (at http://localhost:8080) for integration with custom web UIs.
mlx_lm.server --model Qwen/Qwen3.5-0.8B
While MLX can load standard Hugging Face repositories, using quantized models from the mlx-community is recommended for systems with 16GB of RAM.
- Quantization: Using 4-bit or 8-bit variants (e.g., mlx-community/Qwen2.5-1.5B-Instruct-4bit) significantly reduces VRAM usage and improves generation speed.
- Unified Memory: Because the GPU and CPU share the same 16GB pool, ensure high-memory applications (like Chrome) are closed when running larger models to prevent Metal "Out of Memory" errors.



Top comments (0)