DEV Community

Roger Rajaratnam
Roger Rajaratnam

Posted on • Originally published at sourcier.uk

Running Local AI Models on macOS

Original post: Running Local AI Models on macOS

I use GitHub Copilot at work and Claude for personal projects. Both switched to usage-based billing this month, dropping the flat subscription model. For anyone using these tools heavily across multiple projects, that shift makes the monthly cost unpredictable. Running models locally removes that variable entirely: no usage bills, no rate limits, and everything stays on your machine.

The quality gap has closed enough that local models are a realistic daily driver now, not just an experiment.

This guide covers the first-time setup on a Mac with Apple Silicon. I run this on an M1 MacBook Pro with 16 GB of unified memory. The default settings are tuned for that hardware, but each relevant section also covers what to change if you have more RAM.

How it fits together

Mermaid diagram

Diagram fallback for Dev.to. View the canonical article for the full version: https://sourcier.uk/blog/local-ai-ollama-setup

Prerequisites

  • macOS with Apple Silicon (M series)
  • Homebrew installed
  • A few GB of free disk space per model (most 7–8B models need 4–5 GB each)
  • VS Code, for the integration sections at the end; a GitHub Copilot subscription is needed to use cloud models, but the local Ollama integration works without one

Install Ollama

Ollama is the runtime that downloads, manages, and serves local models. Install it via Homebrew:

brew install --cask ollama
Enter fullscreen mode Exit fullscreen mode

Alternatively, download the installer directly from ollama.com. Once launched, Ollama places an icon in the menu bar and starts the API server at http://localhost:11434. Confirm it is running:

curl http://localhost:11434
Enter fullscreen mode Exit fullscreen mode

The response should be Ollama is running.

Memory and performance settings

Running a language model is fundamentally a memory operation, not a compute one. A model's weights are the billions of numerical parameters that encode its behaviour, and they must be loaded entirely into RAM before a single token can be generated. A 7B model in Q4_K_M quantisation takes around 4–5 GB; an 8B model is similar. If those weights do not fit and the system starts paging to disk, inference slows to a near halt regardless of how fast your CPU is.

On Apple Silicon this matters more than on a typical machine: the CPU, Metal GPU, and every running application share a single pool of unified memory. VS Code, a dev server, a browser, and Ollama are all drawing from the same 16 GB.

Ollama's defaults are generous with memory, which compounds these pressures. Without tuning, the runtime may load multiple models simultaneously, allocate a context window far larger than needed, and leave your other tools fighting for RAM.

Add these variables to ~/.zshrc or ~/.zprofile:

# Limit concurrency — one model at a time on 16 GB
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1

# Keep the model warm between requests — avoids cold-start latency in VS Code
export OLLAMA_KEEP_ALIVE=30m

# Default context window
export OLLAMA_CONTEXT_LENGTH=4096

# Apple Silicon optimisations — the highest-impact pair for 16 GB
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
Enter fullscreen mode Exit fullscreen mode

Then apply them without restarting your shell:

source ~/.zshrc
Enter fullscreen mode Exit fullscreen mode
Setting Why
OLLAMA_MAX_LOADED_MODELS=1 Prevents multiple models competing for the same 16 GB
OLLAMA_NUM_PARALLEL=1 Explicit default; prevents accidental concurrent loads
OLLAMA_FLASH_ATTENTION=1 Reduces peak activation memory on M1 Metal
OLLAMA_KV_CACHE_TYPE=q8_0 Halves KV cache RAM compared to the default f16
OLLAMA_KEEP_ALIVE=30m Model stays loaded between requests, no cold-start delay

OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE=q8_0 together free around 1–2 GB of effective headroom. That is enough to run 8B parameter models comfortably on 16 GB when they would otherwise be marginal.

Adjusting for more RAM

The settings above are conservative, tuned for 16 GB. On machines with more unified memory you can relax the concurrency limits and drop the KV cache compression:

Setting 16 GB (M1/M2) 32 GB (M2 Pro/M3 Pro) 64 GB+ (M3 Max/Ultra)
OLLAMA_MAX_LOADED_MODELS 1 2 3 or more
OLLAMA_NUM_PARALLEL 1 2 4
OLLAMA_KV_CACHE_TYPE q8_0 q8_0 or omit Omit: use default f16

OLLAMA_FLASH_ATTENTION=1 is still worth keeping on any Apple Silicon machine: it reduces peak activation memory regardless of total RAM.

Staying fully local

Ollama does not send your prompts anywhere by default. If you are working with sensitive data and want a hard guarantee, add this flag too:

# Optional — disables remote inference and web search entirely
export OLLAMA_NO_CLOUD=1
Enter fullscreen mode Exit fullscreen mode

Choosing a model

Every model has a name and a size tag. The number in the tag reflects how many billion parameters it contains, which determines both output quality and how much RAM it needs to load. Use the table below to pick the right fit for your hardware and use case.

Model reference

Model Size Vision Best for
gemma3:4b ~2.5 GB Yes Fast chat, vision, light tasks
qwen3:8b ~4.5 GB No Best all-rounder, strong reasoning
qwen2.5vl:7b ~4.5 GB Yes Vision and text
qwen2.5vl:3b ~2 GB Yes Lightweight vision
qwen2.5-coder:7b ~4.3 GB No Code generation
mistral-nemo ~7 GB No Long documents, 32K context
gemma3:12b ~8 GB Yes Higher quality, viable with flash attention
nomic-embed-text ~0.3 GB No Embeddings and RAG pipelines

On 16 GB, avoid 13B models and larger. They will page to swap and feel sluggish under any real workload. On 32 GB you can run 13B and 14B models comfortably, and gemma3:12b and qwen3:14b become reliable daily drivers. On 64 GB or more, 27B and 32B models are viable. Check ollama.com/library for the full catalogue.

Prefer Q4_K_M quantised variants when available. They offer the best speed-to-quality tradeoff regardless of hardware tier.

When you want to attach an image to a conversation, switch to a vision model like qwen2.5vl:7b or gemma3:4b. Text-only models reject image input.

Pulling and running a model

Use ollama pull to download a model and ollama run to test it interactively:

ollama pull qwen3:8b
ollama run qwen3:8b
Enter fullscreen mode Exit fullscreen mode

The first pull downloads several gigabytes, so run this on a decent connection. After that, the model lives on disk at ~/.ollama/models/ and launches instantly.

A good starting set for most daily-use scenarios:

ollama pull qwen3:8b           # daily driver — best all-rounder
ollama pull qwen2.5-coder:7b   # coding tasks
ollama pull qwen2.5vl:7b       # vision and text
ollama pull gemma3:4b          # lightweight vision alternative
ollama pull nomic-embed-text   # embeddings and RAG
Enter fullscreen mode Exit fullscreen mode

VS Code Copilot integration

VS Code Copilot can use a local Ollama server as a model provider. The setup is straightforward, but there is one catch: Copilot reads each model's maximum reported context size and may allocate the full window upfront.

Model Reported max context KV cache cost at max
qwen3:8b 41K tokens ~4 GB
qwen2.5vl:7b 128K tokens ~16 GB
qwen2.5-coder:7b 33K tokens ~3 GB

OLLAMA_CONTEXT_LENGTH=4096 sets a global default, but Copilot does not always respect it in API requests. The reliable fix is a Modelfile: a small config file that bakes a capped context window into a named model variant.

Create capped model variants

A Modelfile is a plain text file that tells Ollama how to build a named variant from an existing base. The two fields that matter here are FROM (the base model to derive from) and PARAMETER num_ctx (the context window to enforce):

FROM qwen3:8b
PARAMETER num_ctx 4096
PARAMETER temperature 0.7
Enter fullscreen mode Exit fullscreen mode

temperature controls how much variation the model introduces when generating a response. 0.7 is a reasonable general-purpose default: creative enough to avoid repetitive output, focused enough to stay on topic. The coder variant uses 0.2 because code generation benefits from deterministic output. There is usually one right answer, not several equally valid variations.

Running ollama create <name> -f <Modelfile> registers that file as a new named model. No additional data is downloaded: Ollama references the base model already on disk with the specified parameters baked in.

The following creates all four variants in one pass:

mkdir -p ~/ollama-models

printf 'FROM qwen3:8b\nPARAMETER num_ctx 4096\nPARAMETER temperature 0.7\n' \
  > ~/ollama-models/Modelfile.qwen3-fast

printf 'FROM qwen2.5-coder:7b\nPARAMETER num_ctx 4096\nPARAMETER temperature 0.2\n' \
  > ~/ollama-models/Modelfile.coder-fast

printf 'FROM qwen2.5vl:7b\nPARAMETER num_ctx 4096\n' \
  > ~/ollama-models/Modelfile.vision-fast

printf 'FROM gemma3:4b\nPARAMETER num_ctx 4096\n' \
  > ~/ollama-models/Modelfile.gemma-fast

ollama create qwen3-fast  -f ~/ollama-models/Modelfile.qwen3-fast
ollama create coder-fast  -f ~/ollama-models/Modelfile.coder-fast
ollama create vision-fast -f ~/ollama-models/Modelfile.vision-fast
ollama create gemma-fast  -f ~/ollama-models/Modelfile.gemma-fast
Enter fullscreen mode Exit fullscreen mode

Connect to Ollama and select a model

To wire VS Code Copilot to a local Ollama server, add this to your VS Code settings.json:

"github.copilot.chat.ollama.endpoint": "http://localhost:11434"
Enter fullscreen mode Exit fullscreen mode

You can also do this through the UI: open Copilot Chat (Cmd+Shift+I on macOS), click the model picker dropdown at the top of the chat panel, and choose "Manage Models". VS Code discovers all models running on localhost:11434 automatically once Ollama is running.

Once connected, the capped variants appear in the picker alongside any cloud models. Switch to qwen3-fast, coder-fast, vision-fast, or gemma-fast depending on the task. After starting a conversation, confirm the model loaded:

ollama ps
Enter fullscreen mode Exit fullscreen mode

Copilot CLI

GitHub Copilot has a standalone CLI for the terminal. Install it via Homebrew:

brew install copilot-cli
Enter fullscreen mode Exit fullscreen mode

Once installed, run copilot from any project directory. On first launch it asks you to trust the folder and log in to GitHub. You type prompts directly in the terminal and Copilot can read, modify, and run files in the current directory. It supports plan mode (Shift+Tab to toggle), custom agents, and MCP servers.

You can point it at Ollama to use local models rather than GitHub's cloud. The quickest way is:

ollama launch copilot
Enter fullscreen mode Exit fullscreen mode

This opens a model selector populated from Ollama's library. To specify a model directly:

ollama launch copilot --model qwen3:8b
Enter fullscreen mode Exit fullscreen mode

For manual wiring, set the Ollama endpoint via environment variables before running copilot:

export COPILOT_PROVIDER_BASE_URL=http://localhost:11434/v1
export COPILOT_PROVIDER_API_KEY=
export COPILOT_PROVIDER_WIRE_API=responses
export COPILOT_MODEL=qwen3:8b
Enter fullscreen mode Exit fullscreen mode

One caveat: Copilot CLI works best with a generous context window. The Ollama docs recommend at least 64K tokens, so the 4K capped variants created above are too small for it. Use the base models directly and raise OLLAMA_CONTEXT_LENGTH to 32768 or higher when running Copilot CLI sessions.

Wrapping up

This covers the full stack: Ollama installed and tuned, a model set selected for different use cases, VS Code Copilot wired to local variants, and the standalone Copilot CLI pointed at Ollama. On 16 GB the memory settings and capped context variants make local inference genuinely practical for everyday coding and chat work, not just a curiosity.

For tasks that fit in a 4K context window a local model handles them without touching any external service. For longer context, heavier reasoning, or the times a cloud model simply performs better, the paid providers are still there. The difference is that reaching for them is now a deliberate choice rather than the default.

Keeping up with new model releases is a single ollama pull command. Ollama fetches only changed layers, so updates stay fast even at multi-GB model sizes.

I'm also working on a dedicated machine for local AI inference: custom hardware that removes the unified memory constraint entirely. I'll write that up once it's running. If you're building something similar or have a setup you're happy with, drop a comment below or subscribe via the form at the end of this page to catch that post when it lands.

Top comments (0)