DEV Community

Rob
Rob

Posted on • Originally published at vibescoder.dev

Putting the GPU to Work: Running Local LLMs on a Home Lab

Yesterday we went from a gaming PC on a shelf to a fully configured Coder server with GitHub integration, workspace templates, and AI agents. The dev environment is running. But the RTX 5090's 32 GB of VRAM has been sitting idle, and all the AI work is still going through cloud APIs.

Today, we change that. This session was about installing Ollama, choosing the right models for different coding tasks, getting local inference running on the workstation, and then wiring it all into Coder Agents so local models show up right alongside Anthropic in the model selector. Everything here was done conversationally through Coder Agents, same as always.

Why VRAM Is the Only Spec That Matters

Before pulling any models, it helps to understand the constraint you're optimizing around. For local LLMs, that constraint is VRAM. Not CPU cores, not system RAM, not disk speed. VRAM determines what models you can run, and model size determines how useful they are.

VRAM What You Can Run
8-12 GB 7B models (Qwen3:8b, DeepSeek-Coder 6.7B)
16 GB 14B-20B models (DeepSeek R1 14B, Codestral 25.12)
24-32 GB 27B-35B models, the sweet spot for agentic coding
32 GB+ / unified 70B quantized, Qwen3-Coder-Next

The 32 GB on the RTX 5090 lands squarely in the sweet spot. We can run 35B-parameter models at full quality, which is where the current generation of agentic coding models lives. The 64 GB of system RAM provides headroom for KV cache spillover when context windows get long, and the 2 TB NVMe means models load fast and we can store a whole library of them.

Only one generation model loads into VRAM at a time. Ollama automatically unloads the previous model when you switch. The embedding model is small enough to coexist with any of them.

The Setup Script

Rather than running commands one at a time, we built a single bash script that handles the entire setup in six phases: hardware verification, Ollama install, service configuration, model pulls, verification, and a connection reference card.

The script is tailored to this exact hardware profile but the structure works for any NVIDIA GPU setup. It supports --models-only (already have Ollama, just pull models) and --verify-only (check that everything is working) flags for re-runs.

#!/usr/bin/env bash
# Target: AMD Ryzen 9 9950X3D / RTX 5090 32GB / 64GB DDR5 / Ubuntu 24.04 LTS
set -euo pipefail

PRIMARY_MODEL="qwen3.5:35b-a3b"
CODING_MODEL="devstral-small:24b"
REASONING_MODEL="deepseek-r1:14b"
AUTOCOMPLETE_MODEL="codestral:22b"
EMBEDDING_MODEL="nomic-embed-text"
KEEP_ALIVE="30m"
OLLAMA_HOST="127.0.0.1"
OLLAMA_PORT="11434"

# Phase 1: Verify hardware & drivers
nvidia-smi --query-gpu=driver_version,name,memory.total \
  --format=csv,noheader

# Phase 2: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Phase 3: Configure service (keep models loaded 30min)
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf > /dev/null <<EOF
[Service]
Environment="OLLAMA_KEEP_ALIVE=${KEEP_ALIVE}"
Environment="OLLAMA_HOST=${OLLAMA_HOST}:${OLLAMA_PORT}"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

# Phase 4: Pull models
for model in "$PRIMARY_MODEL" "$CODING_MODEL" "$REASONING_MODEL" \
             "$AUTOCOMPLETE_MODEL" "$EMBEDDING_MODEL"; do
  ollama pull "$model"
done

# Phase 5: Verify
ollama run "$PRIMARY_MODEL" "What is 2+2? Reply with just the number."
curl -sf "http://${OLLAMA_HOST}:${OLLAMA_PORT}/v1/models"

# Phase 6: Connection reference
echo "Ollama API:    http://${OLLAMA_HOST}:${OLLAMA_PORT}"
echo "OpenAI API:    http://${OLLAMA_HOST}:${OLLAMA_PORT}/v1"
echo "Primary Model: ${PRIMARY_MODEL}"
Enter fullscreen mode Exit fullscreen mode

The real script is ~330 lines with color output, error handling, idempotent checks, and flag parsing. This is the condensed version showing the actual work.

What Happened When We Ran It

Phase 1: Hardware Detection

All green:

[OK]  NVIDIA driver: 590.48.01
[OK]  GPU: NVIDIA GeForce RTX 5090 (32607 MiB)
[OK]  Driver 590.48.01 meets minimum requirement (550+).
[OK]  System RAM: 60 GB
[OK]  Available disk: 1717 GB
Enter fullscreen mode Exit fullscreen mode

System RAM reports 60 GB instead of 64 GB. Normal; the kernel and firmware reserve some. Not a problem.

Phase 2: Ollama Install

One curl command. Ollama v0.21.0 installed cleanly, auto-detected the NVIDIA GPU, created a systemd service, and added the user to render/video groups.

Phase 3: Service Configuration

The important piece here is KEEP_ALIVE=30m. Without it, Ollama unloads models from VRAM after 5 minutes of inactivity. Loading a 23 GB model back into memory takes time, and if you're switching between coding and chatting every few minutes, you're hitting cold starts constantly. Thirty minutes keeps things warm during a real work session.

Phase 4: Model Downloads

~44 GB pulled. One failure:

Model Size Status Notes
qwen3.5:35b-a3b 23 GB OK Primary agentic coder. MoE, only 3B params active per token.
devstral-small:24b - FAILED Registry name wrong.
deepseek-r1:14b 9.0 GB OK Chain-of-thought reasoning.
codestral:22b 12 GB OK Fast autocomplete for IDE tab-completion.
nomic-embed-text 274 MB OK Embedding model for codebase search.

devstral-small:24b doesn't exist on Ollama's registry. The correct pull is ollama pull devstral. Registry names don't always match what blogs and guides reference. This is the kind of thing you only learn by running it.

Phase 5: Verification

The automated inference test returned empty. Cold-start timing issue: the bash $() capture returned before the model finished loading 23 GB into VRAM. Manual verification worked immediately after:

$ ollama run qwen3.5:35b-a3b "What is 2+2? Reply with just the number."
4
Enter fullscreen mode Exit fullscreen mode

The OpenAI-compatible API endpoint confirmed working at http://127.0.0.1:11434/v1/models.

Why These Models

Every model in the stack was chosen for a specific job. This isn't a "download the biggest model that fits" strategy. Different tasks have different requirements, and the right model for autocomplete is not the right model for debugging a race condition.

Primary: qwen3.5:35b-a3b is the all-rounder. Best agentic coder available in April 2026 at this VRAM tier. Mixture-of-Experts architecture means only 3B parameters are active per token despite being a 35B model. That gives you big-model quality with small-model speed. 256K context window. Strong tool-calling support. Fits comfortably in 32 GB VRAM at ~22 GB.

Coding: devstral (Mistral's agentic coding model) is trained specifically for multi-file edits, terminal automation, and code repair. Benchmarks highest on Ollama for pure coding tasks. When you need raw code generation without the overhead of reasoning chains, this is the one.

Reasoning: deepseek-r1:14b is the chain-of-thought model. It thinks before answering. Slower, but catches bugs other models miss. At 14B it only needs ~12 GB VRAM, so it loads fast and leaves headroom.

Autocomplete: codestral:22b is optimized for fast inline code completion (fill-in-the-middle). Best fit for IDE tab-complete via Continue.dev. You want this model to be fast above all else.

Embeddings: nomic-embed-text is a lightweight (274 MB) embedding model for codebase search and RAG pipelines. Small enough to run alongside any generation model without VRAM pressure.

Wiring It Into Dev Tools

With Ollama running, everything that speaks the OpenAI API format can connect to it:

Ollama API:    http://127.0.0.1:11434
OpenAI API:    http://127.0.0.1:11434/v1
API Key:       ollama  (placeholder, not validated)
Primary Model: qwen3.5:35b-a3b
Enter fullscreen mode Exit fullscreen mode

Continue.dev (VS Code)

name: Local Coder
version: 1.0.0
schema: v1
models:
  - name: Qwen3.5 35B (Chat/Edit)
    provider: ollama
    model: qwen3.5:35b-a3b
    roles: [chat, edit, apply]
  - name: Devstral (Coding)
    provider: ollama
    model: devstral-small:24b
    roles: [chat, edit]
  - name: Codestral (Autocomplete)
    provider: ollama
    model: codestral:22b
    roles: [autocomplete]
  - name: Nomic Embed
    provider: ollama
    model: nomic-embed-text
    roles: [embed]
context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: codebase
Enter fullscreen mode Exit fullscreen mode

Environment Variables

For scripts and agents that use the OpenAI client format:

export OLLAMA_HOST=http://127.0.0.1:11434
export OPENAI_API_BASE=http://127.0.0.1:11434/v1
export OPENAI_API_KEY=ollama
Enter fullscreen mode Exit fullscreen mode

Connecting to Coder Agents

This is the real payoff. Ollama is running, the models are loaded, and the OpenAI-compatible API is live on localhost. Now we wire it into Coder Agents so local models appear as selectable options right alongside the cloud providers.

Coder Agents runs the LLM loop in the control plane, not inside workspaces. That means the Coder server process makes the API calls directly. Since Ollama and the Coder server are running on the same machine, this is just pointing one localhost process at another. No tunnels, no port forwarding, no API keys leaving the box.

Step 1: Add the Provider

In the Coder dashboard, navigate to Agents > Admin > Providers and select OpenAI Compatible. Coder treats any endpoint that implements the OpenAI chat completions API as a first-class provider.

Set the Base URL to http://127.0.0.1:11434/v1 and enter ollama as the API key. Ollama doesn't validate keys, but Coder requires one, so this is a placeholder.

OpenAI Compatible provider configuration in the Coder admin panel, with the base URL set to the local Ollama endpoint

For Key policy, keep the defaults: Central API key on, user API keys off. There's no reason for individual developers to bring their own key to a local Ollama instance. Everyone hits the same GPU.

Step 2: Add Models

Switch to the Models tab and add each model you want available in the Agents chat. The Model Identifier must match exactly what Ollama expects, because that string is sent directly to the /v1/chat/completions endpoint.

Adding the Qwen 3.5B model with its identifier and context limit

We added two models to start:

Model Identifier Display Name Context Limit
qwen3:35b-a3b Qwen 3.5B 32,768
devstral Devstral 131,072

The Cost Tracking, Provider Configuration, and Advanced sections can all be skipped for local models. No token pricing to track (it's your own GPU), and the default generation parameters work fine.

Step 3: Use It

That's it. The models now appear in the Agents model selector dropdown alongside the existing Anthropic models. Pick one, start a conversation, and the entire inference loop runs on the local GPU.

Devstral running a task in Coder Agents, fully local inference

What Surprised Us

They work. Not "sort of work" or "work for simple prompts." The local models handle real agentic tasks through Coder Agents: reading files, running shell commands, editing code across multiple files, and reasoning about the results. Devstral in particular was impressive for code-focused work.

The latency difference compared to cloud providers is noticeable but not a dealbreaker. First-token time is slower because the model is running on a single consumer GPU rather than a cluster, but once inference is rolling, the throughput is solid. For the kind of iterative coding tasks Coder Agents handles, the tradeoff is worth it: zero API costs, zero data leaving your network, and no rate limits.

The practical recommendation: keep your cloud provider (Anthropic, OpenAI, whatever you're already using) as the default for complex, multi-step tasks. Use the local models for focused coding work, experimentation, and anything where you want to iterate fast without watching a billing dashboard.

Ollama vs vLLM: When to Scale Up

We chose Ollama because this is a single-developer workstation. Ollama wins on simplicity, resource efficiency, and single-user performance. One curl to install, one command to pull models, and it just works.

The tradeoff: if you later need to serve multiple concurrent Coder workspaces (5+ users hitting the same GPU), vLLM delivers roughly 16x more throughput under concurrent load. That's a future upgrade path, not a day-one requirement.

docker run --rm -it --gpus all --ipc=host --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
  --served-model-name qwen3-coder-next \
  --port 8000 \
  --max-model-len 170000 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --enable-prefix-caching \
  --kv-cache-dtype fp8
Enter fullscreen mode Exit fullscreen mode

Gotchas

  1. Registry names lie. devstral-small:24b is what guides reference. devstral is what Ollama's registry actually has. Always check ollama search or the Ollama website before assuming a model name.

  2. Cold starts kill scripted tests. Loading 23 GB into VRAM takes real time. If you're capturing output in a bash script with $(), the command can return before the model finishes loading. Manual ollama run works fine because it waits interactively.

  3. KEEP_ALIVE is essential. The default 5-minute unload timer means constant cold starts during normal coding. Set it to 30m or -1 (indefinite) via the systemd override. This is the single biggest quality-of-life improvement.

  4. 60 GB != 64 GB is normal. The kernel and firmware reserve memory. Your 64 GB kit will report ~60 GB usable. This is expected, not a hardware problem.

  5. Coder requires an API key even when the provider doesn't. Ollama doesn't authenticate requests, but Coder's provider config won't save without a key. Use any placeholder string. ollama works.

  6. Model identifiers must be exact. The string you enter in Coder's admin panel is sent verbatim to the /v1/chat/completions endpoint. If you type qwen3.5:35b-a3b but Ollama expects qwen3:35b-a3b, you'll get a model-not-found error. Run ollama list and copy the name exactly.

What's Next

The models are running locally and wired into Coder Agents. We have a fully self-hosted AI coding environment: Coder server, Ollama, and local inference on the same box, with cloud providers as a fallback.

The next step is benchmarking. How many tokens per second does qwen3.5:35b-a3b actually push on this hardware? Is the 256K context window usable in practice, or does performance degrade at long contexts? Does codestral:22b autocomplete feel instant in the IDE, or is there noticeable lag? And the real question: for which tasks do local models match cloud providers, and where do they fall short?

Numbers coming soon.

By the Numbers

  • 1 Ollama install (v0.21.0, single curl command)
  • 5 models pulled (4 generation + 1 embedding)
  • ~44 GB total model storage
  • 32,607 MiB VRAM available
  • 2 models configured in Coder Agents (Qwen 3.5B + Devstral)
  • 1 model name that was wrong (devstral-small:24b -> devstral)
  • 1 cold-start timing bug in the verification script
  • 15 minutes from script start to working local inference
  • 0 cloud API calls required
  • 0 data leaving the network

Top comments (0)