Getting Started with Ollama: Run LLMs Locally in 10 Minutes

Mohit Kumar — Sun, 28 Jun 2026 01:18:12 +0000

If you've ever wanted to run a large language model on your own machine — no API key, no cloud bill, no data leaving your laptop — Ollama is the easiest way to get there. It packages model weights, a runtime (built on llama.cpp), and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows.

This guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API.

Why run models locally?

Privacy — your prompts and data never leave your machine.
Cost — no per-token billing. You pay once, in hardware (or nothing, if you already have a decent laptop).
Offline — works on a plane, in a SCIF, or wherever your Wi-Fi doesn't.
Control — swap models, tweak parameters, fine-tune behavior with no rate limits.

The tradeoff: local models are generally smaller and slightly behind frontier cloud models (GPT, Claude, Gemini) on raw capability — though the gap keeps shrinking fast.

Installation

macOS

Download the app from ollama.com/download, or use Homebrew:

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

This installs the ollama binary and sets up a systemd service so it runs in the background. Check it's alive:

systemctl status ollama

Windows

Download OllamaSetup.exe from ollama.com/download and run it — no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices.

Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Add --gpus=all if you have an NVIDIA GPU and the NVIDIA Container Toolkit installed.

Verify it's working

ollama --version
ollama list

An empty list is expected on a fresh install — it just confirms the daemon is up and responding.

Run your first model

ollama run llama3.2

This pulls the model (a few GB, one-time download) and drops you into an interactive chat session. Type a prompt, hit enter, get a response. Ctrl+D or /bye exits.

Core commands cheat sheet

Command	What it does
`ollama run <model>`	Pull (if needed) and chat with a model
`ollama pull <model>`	Download a model without starting a chat
`ollama list`	Show models you have installed
`ollama ps`	Show models currently loaded in memory
`ollama show <model>`	Show details/parameters for a model
`ollama rm <model>`	Delete a model to free disk space
`ollama stop <model>`	Unload a model from memory
`ollama create <name> -f Modelfile`	Build a custom model from a Modelfile

Always pull with an explicit tag for anything you depend on (ollama pull qwen2.5-coder:7b), since :latest can change under you.

Picking a model for your hardware

Ollama's library has hundreds of models. As a starting point:

Use case	Try	Rough RAM/VRAM
General daily driver, light hardware	`llama3.2:3b`	~4 GB
General daily driver, mid hardware	`llama3.1:8b` or `qwen3:8b`	~6–8 GB
Coding	`qwen2.5-coder:7b` or `qwen3-coder:30b` (MoE, runs lighter than its size suggests)	6–20 GB
Reasoning / math / step-by-step logic	`deepseek-r1:7b` or `:14b`	6–12 GB
Best quality you can fit on a single consumer GPU	`qwen3.6:27b` or `gpt-oss:20b`	~16–24 GB
Vision (images + text)	`llava` or `gemma3:12b`	8–16 GB
Embeddings (for RAG / semantic search)	`nomic-embed-text`	<1 GB

Rule of thumb for sizing: a 7–8B model at Q4 quantization needs roughly 5–6 GB of memory; rough numbers, not gospel. Mixture-of-experts models (the ones with an "active/total" split, like qwen3-coder:30b) only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies — but they still need the full model in memory, not just the active slice. Always check ollama.com/library for the current tag list, since model lineups change weekly.

If you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one.

Using the API

Ollama exposes a REST API on localhost:11434 — this is how every IDE plugin, chat UI, and framework talks to it under the hood.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "Explain Ollama in one sentence." }],
  "stream": false
}'

It also exposes an OpenAI-compatible endpoint, so anything built for the OpenAI SDK can point at Ollama with a base URL change:

http://localhost:11434/v1/chat/completions

Python

pip install ollama

from ollama import chat

response = chat(model='llama3.2', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)

Customizing a model with a Modelfile

Want a model with a fixed system prompt or different default parameters? Create a Modelfile:

FROM llama3.2

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """
You are a terse code reviewer. Point out bugs and style issues only — no praise, no fluff.
"""

Build it:

ollama create code-reviewer -f Modelfile
ollama run code-reviewer

Now code-reviewer is its own model in ollama list, with your settings baked in.

A few practical tips

Bind address: by default Ollama only listens on 127.0.0.1. Setting OLLAMA_HOST=0.0.0.0 exposes the API to your whole network with no authentication — fine on a trusted LAN, risky anywhere else.
Multiple models loaded at once: OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS control concurrency if you're serving more than one model.
Long contexts are expensive: KV cache memory scales with context length, not just model size. A 70B model at 128K context can add tens of GB beyond the weights alone. Set num_ctx deliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers.
GPU not being used? Check ollama ps — it shows whether a model is running on CPU or GPU. Driver issues (CUDA/ROCm) are the most common cause of silent CPU fallback.

Where to go next

Browse ollama.com/library for the full, constantly-updated model list.
Point any OpenAI-SDK-based tool (LangChain, LlamaIndex, Continue, etc.) at http://localhost:11434/v1 to swap in local models with minimal code changes.
Pair a small embedding model (nomic-embed-text) with a chat model to build a local RAG pipeline with zero API cost.

That's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.

Building a Reliable Webhook Delivery System: What Actually Broke and How I Fixed It

Mohit Kumar — Tue, 23 Jun 2026 05:24:48 +0000

Webhooks seem simple until a worker crashes mid-delivery, a subscriber goes down for an hour, or a payload gets tampered with in transit.

Here's what I actually built to handle that — FastAPI + PostgreSQL + Redis.

The core problems I solved:

1. Synchronous delivery blocks everything
Naive approach calls the subscriber URL inline. One slow endpoint stalls your whole ingest. Fix: return 202 Accepted immediately, persist the event, deliver async.

2. Workers crash and jobs disappear
If a worker dies mid-delivery, that job is stuck IN_FLIGHT forever. Fix: a watchdog sweeping every 30s, requeuing anything stale.

3. Retries without backoff make things worse
Hammering a struggling subscriber on failure makes recovery harder. Fix: exponential backoff (2s → 32s, max 5 attempts) using a Redis sorted set as a delay queue — score = next attempt timestamp.

4. One dead subscriber degrades the whole system
Fix: circuit breaker per subscription. 5 consecutive failures trips it OPEN. After 60s cooldown, one probe tests recovery before resuming.

5. No payload integrity
Fix: per-subscription HMAC-SHA256 signature on every payload, verified with hmac.compare_digest to eliminate timing attacks.

Result: 99.9% delivery reliability across 10,000+ daily webhooks, with full visibility via Prometheus + Grafana.

Full deep-dive coming soon.

DEV Community: Mohit Kumar