Running large language models locally used to feel like something only research labs or hardcore ML engineers could do. Microsoft Foundry Local changes that especially on Apple Silicon Macs by making local AI practical, fast, and developer friendly.
In this post, we’ll look at:
What Foundry Local actually is
How CPU and GPU execution works on macOS
What “GPU” really means on a Mac (and what CUDA does not)
How to run models locally
And how to build a simple chat app that talks to Foundry using an OpenAI compatible API
No cloud. No API keys. Your machine, your data, your models.
What Is Foundry Local?
Foundry Local is a Microsoft tool that lets you run AI models directly on your machine while exposing them through an OpenAI compatible REST API.
Think of it as a local model runtime that handles:
Model downloads and lifecycle
Hardware acceleration (CPU or Apple GPU)
A /v1/chat/completions endpoint that behaves like OpenAI’s API
Once a model is running, any app that can call an OpenAI-style API can talk to it—including your own Python scripts.
Key benefits:
Privacy data never leaves your Mac
Offline capability works without internet after download
Zero API cost no per token billing
Real performance especially on Apple Silicon GPUs
Foundry Local is currently in public preview.
CPU vs GPU on macOS (Apple Silicon)
On Apple Silicon Macs (M1, M2, M3), Foundry can run models in two ways:
CPU Mode
Runs entirely on the CPU
Works on all Macs
Slower for inference
Required for some large models (for example, gpt-oss-20b)
GPU Mode (Recommended)
Uses Apple’s integrated GPU
Much faster token generation
Ideal for interactive chat
Supported by most modern models
You choose the device when starting a model:
Use GPU (best experience on Apple Silicon)
foundry model run qwen2.5-0.5b --device GPU
Force CPU
foundry model run qwen2.5-0.5b --device CPU
Let Foundry decide
foundry model run qwen2.5-0.5b --device Auto
If a model has no GPU variant, you’ll see an error like:
Exception: No model found for alias 'gpt-oss-20b' that can run on GPU
That simply means: CPU only.
What “GPU” Means on a Mac (and What CUDA Is Not)
This part often causes confusion, so let’s clarify.
GPU on macOS
Apple Silicon GPUs are integrated (not discrete NVIDIA cards)
Foundry uses Apple Metal, Apple’s native GPU framework
This allows efficient matrix operations required by LLMs
CUDA (Not Used on Mac)
CUDA is NVIDIA only
It does not exist on macOS
Any mention of CUDA applies to Linux/Windows systems with NVIDIA GPUs not Macs
So on macOS:
✅ GPU acceleration = Apple Metal
❌ CUDA = not applicable
Foundry Local automatically handles this you don’t need to think about Metal directly.
Installing Foundry Local on macOS
brew tap microsoft/foundrylocal
brew install foundrylocal
Verify:
foundry --version
Available Models on Mac
To see what’s available:
foundry model list
Small & Fast (GPU-enabled)
qwen2.5-0.5b – great for experimentation
qwen2.5-coder-0.5b – optimized for code
phi-3.5-mini
Medium (GPU-enabled)
phi-4-mini
mistral-7b-v0.2
qwen2.5-7b
deepseek-r1-7b
Large (GPU-enabled)
phi-4
qwen2.5-14b
deepseek-r1-14b
CPU-Only
gpt-oss-20b
For interactive chat, GPU models under ~7B parameters give the best experience.
Starting a Model
Load a model (first run downloads it):
foundry model run qwen2.5-0.5b --device GPU
Then check service status:
foundry service status
foundry service list
You’ll see a full model ID like:
qwen2.5-0.5b-instruct-generic-gpu:4
This full ID is required for API calls.
The short alias will not work.
Building a Local Chat App (Python)
Below is a minimal chat app that talks directly to Foundry’s OpenAI-compatible API.
No OpenAI SDK. Just standard Python.
"""Chat with a local OpenAI-compatible API (e.g. Foundry).
On Apple Silicon: run `foundry model run qwen2.5-0.5b --device GPU`
and use the GPU variant ID from `foundry service list`.
Set REASONING_OFF=1 to try disabling reasoning output (if supported)."""
import json
import os
import re
import urllib.request
BASE = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:52999").rstrip("/")
MODEL = os.environ.get(
"OPENAI_MODEL",
"qwen2.5-0.5b-instruct-generic-gpu:4"
)
REASONING_OFF = os.environ.get("REASONING_OFF", "").lower() in ("1", "true", "yes")
def _message_only(text: str) -> str:
"""Strip internal tokens and return only the final message."""
if "final<|message|>" in text:
return text.split("final<|message|>")[-1].strip()
return re.sub(r"<\|[^|]*\|>", "", text).strip()
def chat(messages: list[dict]) -> str:
body = {
"model": MODEL,
"messages": messages,
"max_tokens": 1024,
}
if REASONING_OFF:
body["reasoning"] = False
req = urllib.request.Request(
f"{BASE}/v1/chat/completions",
data=json.dumps(body).encode(),
headers={"Content-Type": "application/json"},
method="POST",
)
with urllib.request.urlopen(req, timeout=120) as r:
out = json.loads(r.read().decode())
raw = (out.get("choices") or [{}])[0].get("message", {}).get("content", "")
return _message_only(raw)
def main():
messages = []
print("Local Chat (type 'quit' to exit)\n")
while True:
try:
user = input("You: ").strip()
except (EOFError, KeyboardInterrupt):
break
if not user or user.lower() in ("quit", "exit", "q"):
break
messages.append({"role": "user", "content": user})
try:
reply = chat(messages)
messages.append({"role": "assistant", "content": reply})
print(f"Assistant: {reply}\n")
except Exception as e:
print(f"Error: {e}\n")
messages.pop()
if __name__ == "__main__":
main()
Key Design Choices
GPU first by default for Apple Silicon
Clean output by stripping internal reasoning tokens
Reasoning toggle using REASONING_OFF
Direct HTTP calls for transparency and simplicity
Conversation memory preserved via message history
Running the App
Load a GPU model
foundry model run qwen2.5-0.5b --device GPU
Check model ID
foundry service list
Run chat
python local_chat.py
Switch models easily:
OPENAI_MODEL=phi-4-mini-generic-gpu:5 python local_chat.py
Performance Expectations (Mac Only)
GPU mode: ~50–100+ tokens/sec (feels instant)
CPU mode: ~10–20 tokens/sec
Large CPU models (20B): slower but usable
For daily interaction, GPU models are strongly recommended.
Why This Matters
Running models locally isn’t just a novelty.
It enables:
Private AI workflows
Offline productivity
Unlimited experimentation
Better understanding of LLM internals
Real control over cost and latency
Foundry Local makes local AI feel like a first class developer experience especially on macOS.
Quick Start Summary
brew tap microsoft/foundrylocal
brew install foundrylocal
foundry model run qwen2.5-0.5b --device GPU
foundry service list
python local_chat.py ( create .py with above code and run it)
You now have a local AI assistant running entirely on your Mac.
No cloud. No CUDA. No tokens burned.
Ways to Run LLMs Locally on a Mac
There isn’t just one way to run models locally. Each option has a different philosophy and tradeoff.
1. Ollama The Simplest Way to Get Started
Ollama is currently the most popular way to run LLMs locally.
It focuses on:
Extremely simple setup
One line model downloads
A friendly CLI experience
Example:
ollama run llama3
That’s it you’re chatting.
Strengths
Very easy to install and use
Great model catalog
Perfect for quick experiments
Works well on Apple Silicon using Metal
Limitations
Opinionated runtime
Limited control over model internals
API is not fully OpenAI-compatible
Less visibility into model variants (CPU vs GPU, quantization details)
Ollama is excellent when you want speed of setup over control.
2. llama.cpp / LM Studio – Low-Level Control
Tools like llama.cpp (and GUIs built on top of it, such as LM Studio) focus on:
Maximum efficiency
Fine grained control over quantization
Running on very constrained hardware
Strengths
Extremely efficient
Deep control over memory and performance
Strong community support
Limitations
Steeper learning curve
Less “application-ready”
Not designed around an OpenAI-style API
These tools are ideal if you care deeply about inference mechanics, not application integration.
3. Microsoft Foundry Local – Local AI for Application Builders
Foundry Local takes a different approach.
Instead of focusing only on chatting with models, it focuses on building applications with local models.
Its core idea is simple:
Run models locally, but expose them through a standard OpenAI-compatible REST API.
This means:
Your existing OpenAI-based apps can switch to local models
You don’t rewrite your client code
You get CPU/GPU optimization handled for you
Foundry feels less like a toy and more like local infrastructure.
Why Foundry Local Is Different
Foundry Local sits at the intersection of:
Local inference
Enterprise-friendly APIs
Application architecture
Key characteristics:
OpenAI-compatible endpoint (/v1/chat/completions)
Explicit CPU vs GPU model variants
First-class Apple Silicon GPU support (via Metal)
Clear separation between model runtime and application code
Works well with Python, Node, REST tools, and agents
If Ollama is “run a model and chat”,
Foundry Local is “run a model and build systems.”
Just compute.
Thanks
Sreeni Ramadorai



Top comments (0)