Run a Local AI Coding Agent for Free: Ollama + qwen2.5 Setup Guide
The cloud AI bill is real. If you're running code generation, refactoring, or doc generation at any scale, per-token costs add up fast. But here's the thing: a $600 desktop can run a 14B parameter model that handles 80% of your daily coding tasks — for free, forever.
This is a hands-on guide from a real deployment. I'm running qwen2.5:14b on a local Ubuntu box and routing it through Ollama as a drop-in replacement for cloud API calls.
Why Ollama + qwen2.5?
Ollama turns running local LLMs into a two-command operation. It handles model downloads, GPU/CPU routing, and exposes an OpenAI-compatible REST API at localhost:11434.
qwen2.5 (from Alibaba's Qwen team) punches well above its weight class:
| Model | Size | Code quality | RAM needed |
|---|---|---|---|
| qwen2.5:7b | 4.7 GB | Strong | 8 GB |
| qwen2.5:14b | 9.0 GB | Excellent | 16 GB |
| qwen2.5:32b | 20 GB | Near-GPT4 | 32 GB |
For most coding tasks, 14b hits the sweet spot. It handles Python, Bash, JavaScript, Go, and Rust confidently.
Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
That's it. Ollama installs as a system service. Verify:
ollama --version
# ollama version 0.5.x
Step 2: Pull qwen2.5:14b
ollama pull qwen2.5:14b
This downloads ~9 GB. Go make coffee. Once done:
ollama list
# NAME ID SIZE MODIFIED
# qwen2.5:14b ... 9.0 GB just now
Step 3: Run It as a systemd Service
For persistent operation (survive reboots, auto-restart on crash):
sudo tee /etc/systemd/system/ollama.service > /dev/null << 'EOF'
[Unit]
Description=Ollama LLM Server
After=network.target
[Service]
Type=simple
User=YOUR_USER
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
Verify it's running:
systemctl status ollama
# ● ollama.service - Ollama LLM Server
# Active: active (running)
Step 4: Test the API
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen2.5:14b",
"prompt": "Write a Python function to parse a JWT token",
"stream": false
}' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
You should get clean, working Python code in under 5 seconds.
Step 5: Build a Python Coding Agent
Here's a minimal agent that reads a task description and generates code:
import requests
import json
import sys
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen2.5:14b"
SYSTEM_PROMPT = """You are a senior Python developer.
When given a task, respond with clean, working Python code only.
No explanations unless asked. No markdown fences."""
def run_coding_agent(task: str) -> str:
payload = {
"model": MODEL,
"prompt": f"{SYSTEM_PROMPT}\n\nTask: {task}",
"stream": False,
"options": {
"temperature": 0.2, # Lower = more deterministic code
"num_predict": 2048,
}
}
response = requests.post(OLLAMA_URL, json=payload, timeout=120)
response.raise_for_status()
return response.json()["response"]
def main():
if len(sys.argv) < 2:
print("Usage: python agent.py 'describe your task'")
sys.exit(1)
task = " ".join(sys.argv[1:])
print(f"[agent] Task: {task}\n")
result = run_coding_agent(task)
print(result)
if __name__ == "__main__":
main()
Usage:
python agent.py "write a CLI tool that monitors disk usage and alerts when over 80%"
Step 6: Use the OpenAI-Compatible Endpoint
Ollama speaks OpenAI's API format. That means any tool built for api.openai.com works with zero code changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="qwen2.5:14b",
messages=[
{"role": "system", "content": "You are a senior Python developer."},
{"role": "user", "content": "Write a function to retry failed HTTP requests with exponential backoff"}
]
)
print(response.choices[0].message.content)
Drop-in replacement. Point your existing AI tooling at localhost:11434/v1 and swap the model name.
Real-World Performance (From My Deployment)
Running on an Intel tower (Ubuntu 24.04, 32 GB RAM, no GPU):
| Task | qwen2.5:14b | GPT-4o |
|---|---|---|
| Simple function | ~3s | ~2s |
| 100-line refactor | ~12s | ~5s |
| API integration scaffold | ~18s | ~8s |
| Cost | $0 | $0.02–0.08 |
CPU-only is ~2-3x slower than cloud, but $0 per call changes the math entirely. You can run 10,000 generations without watching a billing dashboard.
Multi-Model Setup
Run multiple models simultaneously for different tasks:
ollama pull qwen2.5:7b # Fast completions
ollama pull qwen2.5:14b # Main coding tasks
ollama pull llama3.1:8b # General Q&A
ollama pull mistral:7b # Summarization
Route by task complexity:
def get_model(task_type: str) -> str:
routing = {
"quick": "qwen2.5:7b",
"code": "qwen2.5:14b",
"general": "llama3.1:8b",
"summary": "mistral:7b"
}
return routing.get(task_type, "qwen2.5:14b")
What This Unlocks
Once you have a local API at localhost:11434:
- Neovim/VS Code AI plugins — point any Copilot-compatible plugin at it
- CI pipelines — generate PR summaries, test stubs, changelogs for free
- Personal automation — AI-powered scripts that run on your cron without cloud costs
- Fleet deployment — run one Ollama instance, serve the whole LAN
Quickstart Checklist
- [ ]
curl -fsSL https://ollama.com/install.sh | sh - [ ]
ollama pull qwen2.5:14b - [ ] Configure systemd service (auto-start on boot)
- [ ] Test:
curl localhost:11434/api/generate - [ ] Save the Python agent script above
- [ ] Point your OpenAI-compatible tools at
localhost:11434/v1
Hardware note: 16 GB RAM minimum for 14b. If you're on 8 GB, use qwen2.5:7b — still excellent for most tasks. GPU is optional; CPU works fine for async/batch workloads.
Drop a comment if you hit any issues. Happy to share more of the setup — I'm running this as part of a larger autonomous agent stack.
Top comments (0)