DEV Community

Cover image for Run LLMs Locally with Ollama in 2026: The Practical Developer Guide
galian for Cursuri AI

Posted on

Run LLMs Locally with Ollama in 2026: The Practical Developer Guide

For years, "run the model locally" was the option you mentioned and then didn't take: the models were too weak, the tooling too fiddly, and the cloud APIs too convenient. In 2026 that calculus has genuinely shifted. Open-weight models in the 12–35B range now handle real coding and agent workloads, Apple Silicon got a dedicated inference engine, and Ollama quietly became a drop-in backend for the same tools you already use against cloud APIs — including Claude Code.

I teach practical AI engineering at Cursuri-AI.ro, Eastern Europe's AI education platform, and local inference has gone from a curiosity module to one of the questions companies ask us most — usually spelled "how do we use LLMs without sending our data anywhere?" This guide is the answer I give developers: what changed, what hardware you actually need, which models are worth pulling, and how to plug it all into a real workflow.

As always with this space: versions and model rankings move monthly. Everything below is verified against Ollama's official blog and docs as of early July 2026 — re-check before you build a budget or an architecture on it.

Why local, and why now

Three arguments have survived contact with production; the rest is mostly vibes.

Privacy and data residency. With a local model, prompts and outputs never leave your machine (or your VPC, if you self-host on a server). For anyone dealing with client data, medical text, legal documents, or GDPR-sensitive workloads, this eliminates the entire "what does the provider do with my data" conversation instead of managing it through contracts. This is the single biggest adoption driver we see in Europe, and it's the backbone of our course on local LLMs, self-hosting, and privacy.

Cost shape. Cloud APIs bill per token; local inference bills you once, in hardware you may already own. For high-volume, latency-tolerant workloads — batch classification, summarization pipelines, internal tooling — a mid-range GPU that's already on someone's desk can absorb work that would otherwise be a real monthly line item. (For low-volume or frontier-quality work, cloud still wins. More on that below.)

No external dependency. A local model doesn't get deprecated, rate-limited, price-changed, or suspended out from under you. After the model-availability surprises of the last year, "at least one workload runs on weights we control" has become a reasonable line item in a resilience plan, not paranoia.

What actually changed in Ollama in 2026

If you last touched Ollama when it was "a nice wrapper around llama.cpp," the 2026 releases are the reason to look again. All of this is from Ollama's official blog:

  • Anthropic API compatibility (January, v0.14.0). Ollama now exposes a native Anthropic-style /v1/messages endpoint. This is the sleeper feature of the year: Anthropic-native tools — most notably Claude Code — can talk to a local model directly, with no proxy or translation layer. There's a matching OpenAI-compatible endpoint too, so Codex and OpenAI-SDK apps work the same way.
  • ollama launch (January). A single command that configures and starts a coding agent against a local model — ollama launch claude sets up Claude Code, prompts you to pick a model, and you're in.
  • Experimental image generation (January). Early days, but the scope of "local model" is no longer text-only.
  • MLX engine on Apple Silicon (March preview → June release). Ollama moved its Mac inference path to Apple's MLX framework, which exploits unified memory. Ollama's own framing for the June release: its highest performance on Apple Silicon yet — faster output with reduced memory usage.
  • Ollama 0.30 and 0.31 (June). Version 0.30 brought improved performance and broader GGUF model compatibility through llama.cpp; 0.31 made Gemma 4 significantly faster on Apple Silicon via multi-token prediction (MTP), enabled by default.

The theme is clear: Ollama is positioning itself less as a hobbyist toy and more as the standard local backend for agentic tooling.

Getting started in five minutes

Install (macOS and Windows have installers at ollama.com/download; on Linux):

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Pull and run a model:

ollama pull gemma4
ollama run gemma4
Enter fullscreen mode Exit fullscreen mode

That's a working local chat. Ollama also starts a local server on port 11434, which is where the interesting part begins — every API-based tool you have can point at it.

Useful daily commands: ollama ls (installed models), ollama ps (what's loaded and where — CPU vs GPU), ollama rm <model> (free disk space; models are multi-gigabyte).

Hardware: the honest sizing guide

The rule of thumb that matters: a model quantized to 4 bits needs very roughly 0.5–0.7 GB of memory per billion parameters, plus overhead for context. Everything else follows from that.

Your hardware What runs comfortably Experience
8 GB RAM, no GPU 3–8B models, quantized Fine for chat, drafting, classification. Slow but usable on CPU
16 GB RAM (Apple Silicon) 8–14B models Good daily-driver territory; MLX made this tier notably faster in 2026
24 GB+ (M-series Pro/Max or a 24 GB GPU) 27–35B models Where local coding models get genuinely useful
48 GB+ unified memory / multi-GPU Large MoE models Server-class local inference

Two nuances that save people disappointment:

  • Quantization is why any of this works. Models ship in compressed 4–8 bit variants (the GGUF ecosystem) that trade a small quality loss for a 2–4× memory reduction. Ollama's default tags are already quantized — you rarely need to think about it, but it explains why a "27B model" fits in 24 GB.
  • Mixture-of-experts (MoE) models need memory for their total parameters but compute like their active subset. NVIDIA's Nemotron-3-Super, for example, is a 120B model with 12B active parameters: it runs faster than its size suggests, but you still need the RAM to hold it.

Context length eats memory too — an agent session with 32K+ tokens of context adds real overhead on top of the weights. If you're sizing for coding agents, budget for that, not just the model file.

The mid-2026 open-weight lineup worth knowing

Rankings churn monthly, so treat this as a map, not a leaderboard. From Ollama's model library, the families that matter right now:

  • Gemma 4 (12B–31B) — Google's open family, currently the most-pulled model on Ollama. Multimodal, tuned for reasoning and agentic work, and the main beneficiary of the MLX/MTP speedups on Macs.
  • Qwen3.5 / Qwen3.6 (0.8B–122B) — the ecosystem's Swiss army knife. Qwen3.5 spans everything from edge-tiny to server-large; Qwen3.6 (27B–35B) focuses on agentic coding. qwen3-coder is Ollama's own recommendation for coding-agent use.
  • GLM-5 family — flagship-class open weights (GLM-5 is 744B total / 40B active); strong at coding and long-horizon tasks. Too big for most desktops locally, but available as :cloud variants (see below) and self-hostable on serious hardware.
  • Nemotron-3-Super (120B MoE, 12B active) — NVIDIA's entry, aimed at multi-agent applications.
  • MiniMax-M3 — notable for a 1M-token context window, if your workload is long-document analysis.
  • Specialists: GLM-OCR for document understanding, TranslateGemma (4B–27B, 55 languages) for translation, LFM2 (24B) for on-device deployment, Ornith (9B–35B) for agentic coding.

Sensible defaults: on a 16 GB Mac, start with gemma4:12b. On 24 GB+, try qwen3-coder for code and gemma4:27b for general work. Then run your tasks on them — a model's rank on someone's benchmark tells you little about your use case.

The part that changes your workflow: Ollama as a drop-in API

Ollama's server speaks both major API dialects, which means "switch to a local model" is now a base-URL change, not a rewrite.

OpenAI-compatible (/v1) — any OpenAI-SDK app works:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "Explain GGUF quantization in one paragraph."}],
)
print(resp.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Anthropic-compatible (/v1/messages) — and this is the one with teeth, because it means Claude Code runs against local models. Per Ollama's official docs:

export ANTHROPIC_AUTH_TOKEN=ollama       # accepted but not validated
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model qwen3-coder
Enter fullscreen mode Exit fullscreen mode

Or let Ollama do the wiring for you:

ollama launch claude
Enter fullscreen mode Exit fullscreen mode

Honest caveats before you get excited:

  • Ollama recommends at least 32K tokens of context for Claude Code, and its model suggestions for coding are qwen3-coder locally (30B — you want 24 GB+ of VRAM/unified memory) or glm-4.7:cloud / minimax-m2.1:cloud via Ollama's cloud, which keeps the same API surface but runs the weights remotely.
  • The compatibility layer doesn't cover everything: no prompt caching, no token-counting endpoint, no forced tool selection, no batches API, no PDF inputs (images must be base64). If your workflow leans on those, you'll feel it.
  • A 30B local model is not Opus, and it isn't trying to be. It's "capable pair of hands on an airplane / on confidential code," not "frontier reasoning."

The pattern that actually works in practice is routing: local models for the private, high-volume, or offline work; frontier cloud models for the hard reasoning. Deciding which tier a task belongs to — and building the escalation path — is an architecture skill, and it's exactly the kind of decision we drill in our AI system architecture course.

When local is the wrong choice

Being a fan of local inference means knowing where it loses:

  • Frontier-quality reasoning. For the hardest tasks, top cloud models remain clearly ahead of anything you can run on a workstation. If wrong answers are expensive, don't fight this.
  • Low-volume workloads. If you make a few thousand API calls a month, per-token billing is cheaper than any GPU. Local pays off at volume, at privacy constraints, or at both.
  • Ops you don't want. A self-hosted model is a service you now run: updates, monitoring, capacity. ollama run on a laptop is trivial; a team-wide inference server is real infrastructure.
  • Multimodal breadth and long-tail capabilities. Cloud APIs still bundle more (native PDF understanding, larger tool ecosystems, batch APIs) than the local stack replicates.

One more thing people conflate: running a model locally is different from customizing one. If your actual goal is a model that speaks your domain language or follows your house style, that's a fine-tuning question — LoRA adapters on an open-weight base, then serving the result through Ollama. That pipeline (when to fine-tune vs when to just engineer the prompt) is its own discipline, covered in our fine-tuning course.

Frequently asked questions

Is Ollama free?

The tool itself is open source and free. The models each carry their own licenses — Gemma, Qwen, GLM and friends have different terms, some with restrictions on commercial use. Check the license tab on the model's Ollama page before you ship a product on it. Ollama's optional cloud models are a paid service.

What hardware do I need to run LLMs locally in 2026?

As a rule of thumb at 4-bit quantization: 8 GB of RAM runs 3–8B models, 16 GB runs 8–14B comfortably (especially on Apple Silicon with the MLX engine), and 24 GB+ opens up the 27–35B class where local coding models get genuinely useful. More context = more memory on top of the weights.

Can a local model replace GPT or Claude?

For a growing set of tasks — summarization, classification, drafting, routine coding on mid-size codebases — yes, credibly. For frontier reasoning and the highest-stakes accuracy, no. Production teams typically route: local for private/high-volume work, cloud for the hard 10%.

Can I really use Claude Code with Ollama?

Yes. Since Ollama v0.14.0 (January 2026) there's native Anthropic Messages API compatibility: set ANTHROPIC_BASE_URL=http://localhost:11434, run claude --model qwen3-coder — or just ollama launch claude. Expect a capable assistant, not Opus-level reasoning, and note that prompt caching and a few other API features aren't supported through the compatibility layer.

Ollama vs llama.cpp vs vLLM — which should I use?

Ollama for developer experience: one command, model management, dual API compatibility. llama.cpp (which powers Ollama's GGUF path) for maximum control and minimal footprint. vLLM for high-throughput multi-user serving on server GPUs. Most developers should start with Ollama and only move down the stack when they hit a concrete limit.

The skill underneath the tool

Here's the uncomfortable part: pulling a model is the easy 5%. The value shows up when you can answer the questions around it — which model for which task, how to measure whether the local model is good enough for your workload instead of guessing, how to build the routing and fallback so privacy-sensitive work stays local while hard problems escalate to a frontier model. That's engineering judgment, not tooling trivia.

That judgment is what we teach at our AI education platform — hands-on courses built around real repositories and an interactive AI instructor, covering the full local-to-cloud spectrum: self-hosting and privacy, fine-tuning, architecture, and the agentic workflow on top.

Conclusion

In 2026, local LLMs crossed the line from hobby to infrastructure option. Ollama's dual API compatibility means your existing tools — including Claude Code — can run against open weights with a base-URL change; the MLX engine made a 16 GB MacBook a legitimate inference machine; and the open-weight lineup in the 12–35B range is good enough for a real slice of production work.

The play isn't "cancel your API keys." It's knowing which slice of your workload belongs on weights you control — then running it there deliberately, measured, with an escalation path for everything else. Start with ollama run gemma4 tonight; you're one evening away from having an informed opinion.


Written by the team at Cursuri-AI.ro — practical, hands-on AI engineering courses for developers and professionals across Eastern Europe, from local LLMs and self-hosting to agentic coding, evals, and AI system architecture.

Sources: Ollama Blog · Anthropic API compatibility — Ollama Docs · Claude Code with Anthropic API compatibility — Ollama · Ollama Model Library · Download Ollama

Top comments (0)