DEV Community

SatStack
SatStack

Posted on

Run a Local AI Coding Agent for Free: Ollama + qwen2.5 Setup Guide

Run a Local AI Coding Agent for Free: Ollama + qwen2.5 Setup Guide

The cloud AI bill is real. If you're running code generation, refactoring, or doc generation at any scale, per-token costs add up fast. But here's the thing: a $600 desktop can run a 14B parameter model that handles 80% of your daily coding tasks — for free, forever.

This is a hands-on guide from a real deployment. I'm running qwen2.5:14b on a local Ubuntu box and routing it through Ollama as a drop-in replacement for cloud API calls.


Why Ollama + qwen2.5?

Ollama turns running local LLMs into a two-command operation. It handles model downloads, GPU/CPU routing, and exposes an OpenAI-compatible REST API at localhost:11434.

qwen2.5 (from Alibaba's Qwen team) punches well above its weight class:

Model Size Code quality RAM needed
qwen2.5:7b 4.7 GB Strong 8 GB
qwen2.5:14b 9.0 GB Excellent 16 GB
qwen2.5:32b 20 GB Near-GPT4 32 GB

For most coding tasks, 14b hits the sweet spot. It handles Python, Bash, JavaScript, Go, and Rust confidently.


Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

That's it. Ollama installs as a system service. Verify:

ollama --version
# ollama version 0.5.x
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull qwen2.5:14b

ollama pull qwen2.5:14b
Enter fullscreen mode Exit fullscreen mode

This downloads ~9 GB. Go make coffee. Once done:

ollama list
# NAME              ID              SIZE    MODIFIED
# qwen2.5:14b       ...             9.0 GB  just now
Enter fullscreen mode Exit fullscreen mode

Step 3: Run It as a systemd Service

For persistent operation (survive reboots, auto-restart on crash):

sudo tee /etc/systemd/system/ollama.service > /dev/null << 'EOF'
[Unit]
Description=Ollama LLM Server
After=network.target

[Service]
Type=simple
User=YOUR_USER
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
Enter fullscreen mode Exit fullscreen mode

Verify it's running:

systemctl status ollama
# ● ollama.service - Ollama LLM Server
#    Active: active (running)
Enter fullscreen mode Exit fullscreen mode

Step 4: Test the API

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen2.5:14b",
    "prompt": "Write a Python function to parse a JWT token",
    "stream": false
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
Enter fullscreen mode Exit fullscreen mode

You should get clean, working Python code in under 5 seconds.


Step 5: Build a Python Coding Agent

Here's a minimal agent that reads a task description and generates code:

import requests
import json
import sys

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen2.5:14b"

SYSTEM_PROMPT = """You are a senior Python developer. 
When given a task, respond with clean, working Python code only.
No explanations unless asked. No markdown fences."""

def run_coding_agent(task: str) -> str:
    payload = {
        "model": MODEL,
        "prompt": f"{SYSTEM_PROMPT}\n\nTask: {task}",
        "stream": False,
        "options": {
            "temperature": 0.2,      # Lower = more deterministic code
            "num_predict": 2048,
        }
    }

    response = requests.post(OLLAMA_URL, json=payload, timeout=120)
    response.raise_for_status()
    return response.json()["response"]

def main():
    if len(sys.argv) < 2:
        print("Usage: python agent.py 'describe your task'")
        sys.exit(1)

    task = " ".join(sys.argv[1:])
    print(f"[agent] Task: {task}\n")
    result = run_coding_agent(task)
    print(result)

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Usage:

python agent.py "write a CLI tool that monitors disk usage and alerts when over 80%"
Enter fullscreen mode Exit fullscreen mode

Step 6: Use the OpenAI-Compatible Endpoint

Ollama speaks OpenAI's API format. That means any tool built for api.openai.com works with zero code changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="qwen2.5:14b",
    messages=[
        {"role": "system", "content": "You are a senior Python developer."},
        {"role": "user", "content": "Write a function to retry failed HTTP requests with exponential backoff"}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Drop-in replacement. Point your existing AI tooling at localhost:11434/v1 and swap the model name.


Real-World Performance (From My Deployment)

Running on an Intel tower (Ubuntu 24.04, 32 GB RAM, no GPU):

Task qwen2.5:14b GPT-4o
Simple function ~3s ~2s
100-line refactor ~12s ~5s
API integration scaffold ~18s ~8s
Cost $0 $0.02–0.08

CPU-only is ~2-3x slower than cloud, but $0 per call changes the math entirely. You can run 10,000 generations without watching a billing dashboard.


Multi-Model Setup

Run multiple models simultaneously for different tasks:

ollama pull qwen2.5:7b    # Fast completions
ollama pull qwen2.5:14b   # Main coding tasks  
ollama pull llama3.1:8b   # General Q&A
ollama pull mistral:7b    # Summarization
Enter fullscreen mode Exit fullscreen mode

Route by task complexity:

def get_model(task_type: str) -> str:
    routing = {
        "quick": "qwen2.5:7b",
        "code": "qwen2.5:14b", 
        "general": "llama3.1:8b",
        "summary": "mistral:7b"
    }
    return routing.get(task_type, "qwen2.5:14b")
Enter fullscreen mode Exit fullscreen mode

What This Unlocks

Once you have a local API at localhost:11434:

  • Neovim/VS Code AI plugins — point any Copilot-compatible plugin at it
  • CI pipelines — generate PR summaries, test stubs, changelogs for free
  • Personal automation — AI-powered scripts that run on your cron without cloud costs
  • Fleet deployment — run one Ollama instance, serve the whole LAN

Quickstart Checklist

  • [ ] curl -fsSL https://ollama.com/install.sh | sh
  • [ ] ollama pull qwen2.5:14b
  • [ ] Configure systemd service (auto-start on boot)
  • [ ] Test: curl localhost:11434/api/generate
  • [ ] Save the Python agent script above
  • [ ] Point your OpenAI-compatible tools at localhost:11434/v1

Hardware note: 16 GB RAM minimum for 14b. If you're on 8 GB, use qwen2.5:7b — still excellent for most tasks. GPU is optional; CPU works fine for async/batch workloads.

Drop a comment if you hit any issues. Happy to share more of the setup — I'm running this as part of a larger autonomous agent stack.

Top comments (0)