DEV Community

Cover image for Running AI Models Locally on Your Mac with Microsoft Foundry Local
Seenivasa Ramadurai
Seenivasa Ramadurai

Posted on

Running AI Models Locally on Your Mac with Microsoft Foundry Local

Running large language models locally used to feel like something only research labs or hardcore ML engineers could do. Microsoft Foundry Local changes that especially on Apple Silicon Macs by making local AI practical, fast, and developer friendly.

In this post, we’ll look at:

What Foundry Local actually is
How CPU and GPU execution works on macOS
What “GPU” really means on a Mac (and what CUDA does not)

How to run models locally

And how to build a simple chat app that talks to Foundry using an OpenAI compatible API

No cloud. No API keys. Your machine, your data, your models.

What Is Foundry Local?

Foundry Local is a Microsoft tool that lets you run AI models directly on your machine while exposing them through an OpenAI compatible REST API.

Think of it as a local model runtime that handles:

Model downloads and lifecycle

Hardware acceleration (CPU or Apple GPU)

A /v1/chat/completions endpoint that behaves like OpenAI’s API

Once a model is running, any app that can call an OpenAI-style API can talk to it—including your own Python scripts.

Key benefits:

Privacy data never leaves your Mac

Offline capability works without internet after download

Zero API cost no per token billing

Real performance especially on Apple Silicon GPUs

Foundry Local is currently in public preview.

CPU vs GPU on macOS (Apple Silicon)

On Apple Silicon Macs (M1, M2, M3), Foundry can run models in two ways:

CPU Mode

Runs entirely on the CPU

Works on all Macs

Slower for inference

Required for some large models (for example, gpt-oss-20b)

GPU Mode (Recommended)

Uses Apple’s integrated GPU

Much faster token generation

Ideal for interactive chat

Supported by most modern models

You choose the device when starting a model:

Use GPU (best experience on Apple Silicon)

foundry model run qwen2.5-0.5b --device GPU

Force CPU

foundry model run qwen2.5-0.5b --device CPU

Let Foundry decide

foundry model run qwen2.5-0.5b --device Auto

If a model has no GPU variant, you’ll see an error like:

Exception: No model found for alias 'gpt-oss-20b' that can run on GPU

That simply means: CPU only.

What “GPU” Means on a Mac (and What CUDA Is Not)

This part often causes confusion, so let’s clarify.

GPU on macOS

Apple Silicon GPUs are integrated (not discrete NVIDIA cards)

Foundry uses Apple Metal, Apple’s native GPU framework

This allows efficient matrix operations required by LLMs

CUDA (Not Used on Mac)

CUDA is NVIDIA only

It does not exist on macOS

Any mention of CUDA applies to Linux/Windows systems with NVIDIA GPUs not Macs

So on macOS:

✅ GPU acceleration = Apple Metal

❌ CUDA = not applicable

Foundry Local automatically handles this you don’t need to think about Metal directly.

Installing Foundry Local on macOS

brew tap microsoft/foundrylocal
brew install foundrylocal

Verify:

foundry --version

Available Models on Mac

To see what’s available:

foundry model list

Small & Fast (GPU-enabled)

qwen2.5-0.5b – great for experimentation

qwen2.5-coder-0.5b – optimized for code

phi-3.5-mini

Medium (GPU-enabled)

phi-4-mini

mistral-7b-v0.2

qwen2.5-7b

deepseek-r1-7b

Large (GPU-enabled)

phi-4

qwen2.5-14b

deepseek-r1-14b

CPU-Only

gpt-oss-20b

For interactive chat, GPU models under ~7B parameters give the best experience.

Starting a Model

Load a model (first run downloads it):

foundry model run qwen2.5-0.5b --device GPU

Then check service status:

foundry service status
foundry service list

You’ll see a full model ID like:

qwen2.5-0.5b-instruct-generic-gpu:4

This full ID is required for API calls.
The short alias will not work.

Building a Local Chat App (Python)

Below is a minimal chat app that talks directly to Foundry’s OpenAI-compatible API.
No OpenAI SDK. Just standard Python.

"""Chat with a local OpenAI-compatible API (e.g. Foundry).
On Apple Silicon: run `foundry model run qwen2.5-0.5b --device GPU`
and use the GPU variant ID from `foundry service list`.
Set REASONING_OFF=1 to try disabling reasoning output (if supported)."""

import json
import os
import re
import urllib.request

BASE = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:52999").rstrip("/")
MODEL = os.environ.get(
    "OPENAI_MODEL",
    "qwen2.5-0.5b-instruct-generic-gpu:4"
)
REASONING_OFF = os.environ.get("REASONING_OFF", "").lower() in ("1", "true", "yes")


def _message_only(text: str) -> str:
    """Strip internal tokens and return only the final message."""
    if "final<|message|>" in text:
        return text.split("final<|message|>")[-1].strip()
    return re.sub(r"<\|[^|]*\|>", "", text).strip()


def chat(messages: list[dict]) -> str:
    body = {
        "model": MODEL,
        "messages": messages,
        "max_tokens": 1024,
    }
    if REASONING_OFF:
        body["reasoning"] = False

    req = urllib.request.Request(
        f"{BASE}/v1/chat/completions",
        data=json.dumps(body).encode(),
        headers={"Content-Type": "application/json"},
        method="POST",
    )

    with urllib.request.urlopen(req, timeout=120) as r:
        out = json.loads(r.read().decode())

    raw = (out.get("choices") or [{}])[0].get("message", {}).get("content", "")
    return _message_only(raw)


def main():
    messages = []
    print("Local Chat (type 'quit' to exit)\n")

    while True:
        try:
            user = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            break

        if not user or user.lower() in ("quit", "exit", "q"):
            break

        messages.append({"role": "user", "content": user})
        try:
            reply = chat(messages)
            messages.append({"role": "assistant", "content": reply})
            print(f"Assistant: {reply}\n")
        except Exception as e:
            print(f"Error: {e}\n")
            messages.pop()


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Key Design Choices

GPU first by default for Apple Silicon

Clean output by stripping internal reasoning tokens

Reasoning toggle using REASONING_OFF

Direct HTTP calls for transparency and simplicity

Conversation memory preserved via message history

Running the App

Load a GPU model

foundry model run qwen2.5-0.5b --device GPU

Check model ID

foundry service list

Run chat

python local_chat.py

Switch models easily:

OPENAI_MODEL=phi-4-mini-generic-gpu:5 python local_chat.py

Performance Expectations (Mac Only)

GPU mode: ~50–100+ tokens/sec (feels instant)

CPU mode: ~10–20 tokens/sec

Large CPU models (20B): slower but usable

For daily interaction, GPU models are strongly recommended.

Why This Matters

Running models locally isn’t just a novelty.

It enables:

Private AI workflows

Offline productivity

Unlimited experimentation

Better understanding of LLM internals

Real control over cost and latency

Foundry Local makes local AI feel like a first class developer experience especially on macOS.

Quick Start Summary

brew tap microsoft/foundrylocal
brew install foundrylocal

foundry model run qwen2.5-0.5b --device GPU
foundry service list

python local_chat.py ( create .py with above code and run it)

You now have a local AI assistant running entirely on your Mac.

No cloud. No CUDA. No tokens burned.

Ways to Run LLMs Locally on a Mac

There isn’t just one way to run models locally. Each option has a different philosophy and tradeoff.

1. Ollama The Simplest Way to Get Started

Ollama is currently the most popular way to run LLMs locally.

It focuses on:

Extremely simple setup

One line model downloads

A friendly CLI experience

Example:

ollama run llama3

That’s it you’re chatting.

Strengths

Very easy to install and use

Great model catalog

Perfect for quick experiments

Works well on Apple Silicon using Metal

Limitations

Opinionated runtime

Limited control over model internals

API is not fully OpenAI-compatible

Less visibility into model variants (CPU vs GPU, quantization details)

Ollama is excellent when you want speed of setup over control.

2. llama.cpp / LM Studio – Low-Level Control

Tools like llama.cpp (and GUIs built on top of it, such as LM Studio) focus on:

Maximum efficiency

Fine grained control over quantization

Running on very constrained hardware

Strengths

Extremely efficient

Deep control over memory and performance

Strong community support

Limitations

Steeper learning curve

Less “application-ready”

Not designed around an OpenAI-style API

These tools are ideal if you care deeply about inference mechanics, not application integration.

3. Microsoft Foundry Local – Local AI for Application Builders

Foundry Local takes a different approach.

Instead of focusing only on chatting with models, it focuses on building applications with local models.

Its core idea is simple:

Run models locally, but expose them through a standard OpenAI-compatible REST API.

This means:

Your existing OpenAI-based apps can switch to local models

You don’t rewrite your client code

You get CPU/GPU optimization handled for you

Foundry feels less like a toy and more like local infrastructure.

Why Foundry Local Is Different

Foundry Local sits at the intersection of:

Local inference

Enterprise-friendly APIs

Application architecture

Key characteristics:

OpenAI-compatible endpoint (/v1/chat/completions)

Explicit CPU vs GPU model variants

First-class Apple Silicon GPU support (via Metal)

Clear separation between model runtime and application code

Works well with Python, Node, REST tools, and agents

If Ollama is “run a model and chat”,
Foundry Local is “run a model and build systems.”

Just compute.
Thanks
Sreeni Ramadorai

Top comments (0)