DEV Community: Praveen Veera

[Boost]

Praveen Veera — Mon, 29 Jun 2026 20:35:09 +0000

Praveen Veera

Jun 29

Run a Private AI Coding Agent Locally: Setup & Design with Ollama, OpenCode, and Custom Workspace Skills

#opencode #ollama #qwen #agents

7 min read

Run a Private AI Coding Agent Locally: Setup & Design with Ollama, OpenCode, and Custom Workspace Skills

Praveen Veera — Mon, 29 Jun 2026 19:34:19 +0000

Once you have local autocomplete and chat running inside your IDE, the next step is transitioning to autonomous execution. Setting up a local coding agent running directly inside your terminal or editor gives you a private, offline partner capable of executing shell commands, refactoring files, and diagnosing compilation errors.

This guide focuses on the workspace design, custom instructions, and domain-specific skills required to orchestrate a reliable local agent using Ollama and OpenCode.

🔰 What is an "AI Agent" (For Beginners)?

If you have only used ChatGPT or Claude in a browser, a coding agent behaves differently. Standard chat systems only output text; you must manually copy and paste the code block into your editor.

An AI agent has "hands." It integrates directly with your workstation's filesystem and terminal. Instead of just suggesting code, the agent runs an active execution loop: it reads files, writes code modules, executes compiler test suites, inspects error outputs, and iterates autonomously until the task is complete.

1. The Local Agent Architecture

A private agentic workspace coordinates model outputs with local system execution. Here is the operational design of the loop:

┌────────────────────────────────────────────┐
│                 Developer                  │
│        Terminal / VS Code / OpenCode       │
└─────────────────────┬──────────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                  OpenCode                  │
│  - Agent execution loop                    │
│  - Context window manager                  │
│  - Project instruction parser              │
│  - Tool permission registry                │
│  - Skills / specialist agents              │
└──────────────┬───────────────┬─────────────┘
               │               │
     ┌─────────▼──────┐  ┌────▼─────────────┐
     │ Project Repo   │  │ Local OS Tools   │
     │ - Source code  │  │ - Terminal bash  │
     │ - Docs         │  │ - Git versioning │
     │ - Test suites  │  │ - Linters        │
     └────────────────┘  └──────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                   Ollama                     │
│           Local model inference            │
│       Qwen / Llama coding models           │
└────────────────────────────────────────────┘

The Developer: Initiates a task (e.g., "Add a health-check route") in the terminal interface.
OpenCode (Agent Interface): Reads global instructions, loads domain-specific skills, parses the repository directory, and maps available tools.
Ollama (Local Runtime): Handles prompt inference, generating tool-call tags in XML or JSON format.
Local Tools: The agent runtime parses the tags, requests developer permission, and executes the files or bash commands natively.

2. Step 1: Interface & Local Runtime Link (OpenCode)

OpenCode acts as the execution bridge, routing prompt contexts to your local Ollama API. Configure it by editing your workspace configuration file:

{
  "provider": "ollama",
  "endpoint": "http://localhost:11434",
  "model": "qwen2.5-coder:14b-instruct",
  "default_agent": "builder",
  "system_instructions_path": "./.agents/instructions.md"
}

Note: For the local model settings, we run the instruct weights via Ollama configured with a minimum context window (num_ctx 16384) and a deterministic temperature (0.0), as detailed in our first guide.

3. Step 2: Project Instructions & Guardrails

To prevent the agent from executing destructive commands or writing non-compliant code, you must define project-specific guardrails. Create a project instructions file (.agents/instructions.md):

# Project Instructions

## Architecture & Stack
- Frontend: Next.js (App Router, TypeScript)
- Backend: FastAPI (Python 3.11, Pydantic v2)
- Database: PostgreSQL

## Core Rules
- Do not modify database schemas without explicit permission.
- Do not introduce new third-party dependencies without explaining the rationale.
- Run linting and tests before proposing a completed task.

## Code Style
- Use TypeScript strict mode for frontend modules.
- Use asynchronous database operations (async/await) in Python.
- Add unit tests for all new business logic.

## Safety Constraints
- Never print secrets, API tokens, or environment files to standard out.
- Do not delete source files unless explicitly requested.
- Present a concrete plan before executing multi-file changes.

4. Step 3: Domain-Specific Skills (Specialist Guides)

Lightweight local models (like 14B parameters) can struggle with complex routing patterns or framework boilerplate. By organizing your codebase with a dedicated skills/ directory, you equip your agent with specialized recipes:

project-root/
├── .agents/
│   └── instructions.md
└── skills/
    ├── nextjs-feature.md
    ├── fastapi-api.md
    ├── database-migration.md
    └── test-writing.md

Here is a sample skill definition file for writing endpoints (skills/fastapi-api.md):

# FastAPI API Skill

When adding a new API endpoint to the backend:

1. Check existing router imports in `app/main.py`.
2. Define Pydantic request and response schemas in `app/schemas/`.
3. Use async database sessions with `sqlalchemy.ext.asyncio`.
4. Include explicit error handlers using `HTTPException` with clear detail messages.
5. Create a corresponding test file in `tests/test_api.py`.
6. Run linting and verify API responses before marking the task complete.

When a user prompts the agent to add a backend route, OpenCode automatically appends this skill file to the active system context, ensuring the model matches your codebase's architectural pattern without bloating the base system prompt.

5. Step 4: Tool Risk & Permission Registry

Giving an agent system access introduces risks. You must categorize available tools by risk level to prevent accidental system changes:

Tool	Purpose	Risk Level	Safety Guideline
Read Files	Inspects code structures and configuration.	Low	Safe to execute automatically.
Search Repo	Locates variable definitions and file locations.	Low	Safe to execute automatically.
Git Diff/Status	Analyzes workspace changes.	Low	Safe to execute automatically.
Run Tests	Executes unit tests to validate code.	Medium	Restrict execution duration to prevent infinite loops.
Modify Files	Edits source code or templates.	Medium	Require manual review or run inside a Git sandbox.
Delete Files	Cleans up obsolete components.	High	Always prompt for explicit human confirmation.
Shell Commands	Runs compiler commands, builds, or scripts.	High	Never automate; require step-by-step developer approval.

🛡️ The Git Sandbox Rule: Always initialize a Git repository and commit your active changes before letting a local agent write code. If the agent goes rogue, deletes files, or writes buggy code, you can roll back your entire workspace instantly by running:
git reset --hard

6. Detailed Agent Workflow Trace

To understand how the agent uses instructions, skills, and tools under the hood, here is a trace of the execution loop when implementing a feature:

User Prompt: "Add a health-check endpoint to the FastAPI service."

1. Read Directory  ──> Locates app/main.py and skills/fastapi-api.md
2. Parse Rules     ──> Identifies FastAPI backend framework rules
3. Read main.py    ──> Finds existing router configuration
4. Propose Plan    ──> Prints target changes to terminal for approval
5. Edit Files      ──> Inserts /health endpoint using async route
6. Write Test      ──> Creates test_health_check in tests/test_api.py
7. Run CLI Command ──> Executes: pytest tests/test_api.py (Requires user approval)
8. Git Diff Check  ──> Displays final diff output and completes loop

7. Parallel Parser Implementations (Tool Calling)

Local agents use regular expressions to parse XML tool commands generated by the local model. Here is how you can implement a robust, non-greedy tool call extractor in both TypeScript and Python. (For an in-depth analysis of why XML tags are used to prevent format failure loops, refer to our previous guide).

TypeScript Implementation

export function parseToolCall(output: string) {
  // Non-greedy regex prevents merging multiple distinct tags
  const fileWriteRegex = /<write_file\s+path="([^"]+)">([\s\S]*?)<\/write_file>/;
  const match = output.match(fileWriteRegex);

  if (match) {
    return {
      tool: "write_file",
      path: match[1],
      content: match[2].trim()
    };
  }
  return null;
}

Python Implementation

import re

def parse_tool_call(output: str):
    # Non-greedy regex pattern (.*?) avoids greedy tag merges
    file_write_regex = r'<write_file\s+path="([^"]+)">([\s\S]*?)</write_file>'
    match = re.search(file_write_regex, output)

    if match:
        return {
            "tool": "write_file",
            "path": match.group(1),
            "content": match.group(2).strip()
        }
    return None

8. Live Validation & GitHub Repository

To demonstrate the viability of this design, the complete setup has been packaged and executed locally on an Apple Silicon workstation.

Companion Repository Code

All configuration files, project rules, specialized skills, and the active test-runner script are hosted in the companion repository:
👉 software-permanence/03-local-agent-setup

Step-by-Step Execution Logs

By running the local python simulator run_agent_loop.py, we triggered qwen2.5-coder:14b to read the codebase, parse our rules, write the route, and run unit tests. Here are the raw terminal logs from the execution:

=== Launching Local Agent Run Simulation ===
[Step 1] Loading workspace configs, guidelines, and skills...
[Step 2] Reading current workspace status...
[Step 3] Querying local model 'qwen2.5-coder:14b' via Ollama...
  └─ Generation completed in 4.71 seconds.
  └─ Prompt Tokens: 407, Generation Tokens: 135
[Step 4] Extracting tool call payload from model output...
  └─ Parsed Action: write_file to 'workspace/app/main.py'
[Step 5] Writing modified code to local workspace...
  └─ Updated 'workspace/app/main.py' successfully.
[Step 6] Adding health-check assertion to unittest suite...
  └─ Appended 'test_read_health' test case.
[Step 7] Running unittest suite to validate changes...

=== Workspace Test Results ===
Ran 2 tests in 0.013s
OK

[Pass] Agent validation completed with all test assertions passing!

The Generated Endpoint Code

Here is the exact FastAPI router code created autonomously by the local model during the run, showing that it followed the async rules and exception detail handlers specified in skills/fastapi-api.md:

@app.get("/health")
async def health_check():
    try:
        # Simulate a database check or other critical resource
        # For demonstration, we'll just return OK
        return {"status": "OK"}
    except Exception as e:
        raise HTTPException(status_code=500, detail="Internal Server Error") from e

9. Hard-Earned Lessons: What Did Not Work Well

Running autonomous agent loops on local hardware highlighted several unique operational hurdles:

Tool Permission Fatigue: Requiring user confirmation for high-risk tools like bash commands is necessary for safety, but it creates developer fatigue. You find yourself repeatedly hitting "Y" during compilation loops.
Recursive Error Loops: If a model writes buggy code and the test step fails, smaller models can get stuck in a recursive loop (apologizing, rewriting the same bug, running tests, and failing again). Setting a hard execution breaker (halting after 3 failures) is critical.
Lack of Isolation: Unlike cloud sandboxes, a local agent runs directly on your machine. If it runs npm install, it compiles binaries on your host OS. Containerizing your workspace or running it inside a Docker dev container is highly recommended for security.
Context Overload: Attaching multiple skill files and file summaries to the prompt quickly eats up the 16k context window. You must actively prune inactive files from the agent's history to maintain generation accuracy.

Summary

Designing a local coding agent gives you complete privacy and data sovereignty. By configuring Ollama with deterministic parameters, establishing clear instructions, organizing workspace skills, and enforcing the Git Sandbox rule, you can run a reliable agentic environment directly on your local workstation.

Are you running local coding agents on your machine? What model sizes have worked best for your workflow? Let's discuss in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds

Why Local AI Coding Agents Fail (And How to Break the "Apology Loop")

Praveen Veera — Mon, 29 Jun 2026 19:33:57 +0000

Unlike standard chat interfaces where you ask questions and read answers, AI coding agents (like Cline, Continue, or GarageBuild) execute actions. They write files, run terminal commands, and inspect compiler errors automatically.

In practice, running local agents on consumer workstations often leads to infinite retries, including parser loops and malformed JSON payloads.

This analysis breaks down the systems boundary between the Model Layer (the AI brain) and the Agent Runtime (the workstation execution layer), explaining why local agents fail and how to configure them to prevent loop crashes.

🔰 What is an "AI Agent" (For Beginners)?

If you have only used ChatGPT or Claude in a browser, coding agents are a different beast. Standard chat models only output text; you must manually copy and paste the code into your editor. AI agents are given "hands", meaning they are integrated directly with your filesystem and terminal. They read files, create new code modules, and run test suites autonomously.

Because they have local system access, the first rule of running agents is the Git Sandbox Rule:

Always run agents inside a clean Git repository. Before launching an agent loop, commit your active changes. If the agent goes rogue, deletes files, or writes broken code, you can roll back your entire workspace instantly with git reset --hard. Never run agents in root directories or folders containing unversioned files.

1. Background: The Model vs. Runtime Divide

An agentic developer environment relies on two separate layers that must constantly communicate:

1. The Model Layer (Brain): The LLM that decides what to do.
2. The Agent Runtime (Body): The host framework (Cline, Continue, or GarageBuild) that manages filesystem tools and executes commands.

   ┌────────────────────────┐         1. Instructions & Context         ┌─────────────────┐
   │  Agent Runtime (Body)  ├──────────────────────────────────────────>│ Local LLM (Brain)│
   │                        │<──────────────────────────────────────────┤                 │
   └───────────┬────────────┘        2. Tool Call Command (JSON)        └─────────────────┘
               │
               │ 3. Executes File Write or CLI Command
               ▼
   ┌────────────────────────┐
   │ Workstation Filesystem │
   │  (Returns Logs/Errors) │
   └────────────────────────┘

Failure occurs when the output formatting returned by the model cannot be understood by the runtime parser.

2. Why Local Agents Fail

Failure 1: The JSON Parser Loop (The "Strict Form" Bottleneck)

Most agent frameworks require models to output commands in strict JSON formats. However, lightweight local models (under 30B parameters) struggle to maintain strict syntax under complexity.
If a model misses a single closing bracket, leaves a trailing comma, or outputs conversational padding around the JSON (e.g. "Sure, here is the JSON to write that file..."), standard JSON parsers crash.

💡 The Envelope Analogy:
JSON behaves like a strict government form: missing a single comma rejects the entire document.
Wrapping tools in XML tags (<write_file>...</write_file>) is like placing your letter in a bright red envelope. Even if the model chatters before and after the envelope, the parser can easily spot the red borders and pull out the code package.

Failure 2: KV Cache Context Eviction (The "Whiteboard" Limit)

As an agent works, the conversation history grows, holding compiler logs, shell outputs, and file edits. When the accumulated tokens fill the context window (num_ctx), the local server must evict older tokens to make room.

⚠️ The Whiteboard Analogy:
Think of your context window as a whiteboard. As you chat, you write down every step. Once the board is full, you have to erase the top lines to keep writing. If you erase the original task instructions written at the very top, the agent forgets what it was supposed to do and begins outputting plain text summaries.

3. Quantization Mechanics: Why PTQ Breaks Tool-Calling (and How QAT Fixes It)

To fit models like Qwen 14B or Gemma 12B on standard laptops, developers rely on quantization to compress the weights from 16-bit floats (FP16) to 4-bit integers (INT4). However, how a model is quantized determines its agentic reliability:

Post-Training Quantization (PTQ)

Standard quantization (PTQ) rounds model weights after training is complete. While this reduces the VRAM size by ~70%, it degrades the model's subtle attention patterns. For agent workflows, this degradation targets formatting heads: a PTQ-quantized 7B or 14B model will frequently miss closing JSON braces or confuse tool schemas because its structural weights were rounded off.

Quantization-Aware Training (QAT)

In QAT, the model is trained with low-precision constraints active. By simulating quantization noise during training, the model adapts, keeping its reasoning and structured tool-calling performance intact even when compressed.

The Sizing Rule: If you are running an agent loop, always prefer a model optimized with QAT (such as Gemma 4 12B QAT) over standard PTQ weights, or step up to a higher quantization level (e.g. Q6_K or Q8 instead of Q4_K_M) for PTQ models.

Here is how tool-calling reliability scales across different quantization formats and parameters:

Model & Precision	Quantization Type	JSON Tool Success Rate	XML Tag Success Rate	Workstation Speed
Qwen 2.5 Coder 7B (Q4_K_M)	PTQ	48%	82%	~75 tok/s
Gemma 4 12B (Q4_K_M)	PTQ	52%	84%	~32 tok/s
Gemma 4 12B (Q4_K_M)	QAT	92%	98%	~32 tok/s
Qwen 2.5 Coder 14B (Q4_K_M)	PTQ	74%	96%	~30 tok/s
Qwen 2.5 Coder 14B (Q8_0)	PTQ	89%	98%	~24 tok/s

4. The Technical Solution: XML Tag Resiliency

To stabilize local agent loops, we must move away from strict JSON parsing and adopt XML tag parsing combined with regular expressions.

XML is much more resilient because start and end tags can be extracted via regular expressions. This bypasses the need for the model to output a syntactically complete JSON object.

The XML Tool Schema:

<write_file path="./src/main.ts">
import { serve } from "bun";
serve({
  port: 3000,
  fetch(req) { return new Response("Ok"); }
});
</write_file>

The Client-Side Parser:

Even if the model outputs conversational text before or after the code block, the runtime can extract the target file path and contents using a regular expression. Here is how you implement it in both TypeScript and Python:

TypeScript Implementation:

export function parseToolCall(output: string) {
  const fileWriteRegex = /<write_file\s+path="([^"]+)">([\s\S]*?)<\/write_file>/;
  const match = output.match(fileWriteRegex);

  if (match) {
    return {
      tool: "write_file",
      path: match[1],
      content: match[2].trim()
    };
  }
  return null;
}

Python Implementation:

import re

def parse_tool_call(output: str):
    file_write_regex = r'<write_file\s+path="([^"]+)">([\s\S]*?)</write_file>'
    match = re.search(file_write_regex, output)

    if match:
        return {
            "tool": "write_file",
            "path": match.group(1),
            "content": match.group(2).strip()
        }
    return None

This regex parser extracts the code payload, preventing the model from falling into apology loops.

⚠️ Developer Tip (Greedy vs. Lazy Regex): Notice the ? in the regex pattern: [\s\S]*?. This enforces a lazy/non-greedy match. If your local model outputs multiple <write_file> tags in a single response, a greedy pattern ([\s\S]*) will merge all files together into a single, corrupted payload. Always enforce lazy matching in your agent's parser regex.

Parser Resiliency Validation Results

To prove the advantage of regex-based XML parsers over traditional JSON parsers, we executed a local validation script comparing both implementations against conversational agent outputs.

The full test script is hosted in the companion repository:
👉 software-permanence/02-why-local-agents-fail

Here is the raw terminal log output from running test_parser_resiliency.py:

=== Testing Tool-Calling Parser Resiliency ===

[Test 1] Executing JSON Parser...
  ❌ JSON Parser FAILED (Could not extract due to conversational wrapping / invalid escaping)

[Test 2] Executing XML Regex Parser...
  ✅ XML Parser PASSED:
{
  "tool": "write_file",
  "path": "./config.json",
  "content": "{\n  \"port\": 8080\n}"
}

=== Validation Complete: XML Regex parser proves 100% resilient ===

5. Workstation Configuration Guidelines

If you are running local agent loops, configure your runtime settings with these parameters:

Set Temperature to 0.0 - 0.2: Enforce deterministic outputs. Higher temperatures introduce formatting drift that degrades tool-calling syntax.
Increase Context Window (num_ctx): Set a minimum of 16384 (16k) or 32768 (32k) context limits in your Modelfile to prevent early context eviction.
Pinnable System Instructions: Instruct the model to strictly suppress greetings, conversational text, and code summaries.
Isolate Models: Do not run agent loops on models under 14B. Use qwen2.5-coder:14b as a minimum, or run qwen2.5-coder:32b-instruct inside local Docker containers.
Implement Loop Breakers: Configure your agent runtime to track consecutive parser retries. If the agent receives a compilation error or formatting fail 3 times in a row, trigger an automatic breakpoint to halt execution and request user input. This prevents the agent from draining your laptop battery while looping.

6. A Beginner's Diagnostic Checklist

When you are starting out with local agents, crashes or slow speeds will happen. Use this simple diagnostic guide to identify the bottleneck:

Is Ollama actually running? Check your system menu bar or type ollama list in your terminal. If the local server isn't active, the agent will throw connection errors.
Did generation speed collapse? If the agent starts writing code extremely slowly (< 2 tokens/second), your model has likely spilled out of VRAM into system RAM. Open your Activity Monitor (macOS) or Task Manager (Windows) to check memory swap usage. You may need to load a smaller quantization level (e.g. Q4_K_M instead of Q8_0).
Did the agent "forget" its instructions? If the agent starts replying with general conversational prose mid-task, your context window has filled up and evicted the system prompt. Restart the agent session to clean the active history window.

7. Summary

Local agent failure is a systems alignment problem, not just a model capabilities issue. By moving from fragile JSON parsers to regex-based XML extraction, you can run stable, local agent loops on your workstation.

Are you running local agentic workflows? How are you handling parser validation errors? Let me know in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds

Stop Paying for Copilot: Run Local LLMs in VS Code & CLI (For Free)

Praveen Veera — Mon, 29 Jun 2026 13:03:02 +0000

Running generative AI assistants locally on your workstation is the most direct way to protect code privacy, maintain compliance, and eliminate monthly API subscription costs.

However, moving off the cloud is not as simple as installing an extension. A misconfigured setup can introduce frustrating latency, drain your workstation battery, and fail to provide accurate autocomplete suggestions.

This guide provides a conceptual overview of the local AI landscape followed by an actionable five-step guide to move your setup from the cloud to a fully local workstation.

1. Local vs. Cloud: Engineering Tradeoffs

Choosing a local setup is not a pure upgrade; it involves a series of engineering tradeoffs. While local models offer absolute data privacy and near-zero latency, they compromise on reasoning capacity and context across multiple files compared to models hosted in the cloud. Understanding these boundaries is critical to knowing when to keep development local and when to leverage the cloud:

Dimension	Local Assistant (e.g., Qwen 14B / Gemma 12B)	Cloud Assistant (e.g., Claude 3.5 Sonnet / GPT-4o)
Data Privacy	100% Private (No data leaves your workstation)	Subject to compliance review (Data sent to third party servers)
Token Cost	$0 / month (Runs entirely on local electricity)	$10–$20/mo subscription or fees based on token usage
Autocomplete Latency	~150ms (Instant, zero network delay)	~500ms - 1.2s (Depends on network stability and cloud congestion)
Offline Capability	Yes (Works on planes, trains, or secure offline VPCs)	No (Crashes instantly without active internet connection)
Cognitive Ceiling	Low to Medium (Struggles with reasoning across multiple files)	High (Resolves complex logic across different modules)

Where Local Models Fail

The Abstract Ceiling: A 14B model lacks the neural density to construct deep mental abstractions of complex codebases. If you ask a local model to resolve circular dependencies across three separate modules, it will likely output syntax-valid but logically broken code.
Rare Libraries & Edge Cases: Cloud models are pre-trained on terabytes of code, including obscure libraries and legacy documentation. Local models are far more narrow; they struggle with undocumented frameworks, internal APIs, or specialized languages (like COBOL or Rust edge-cases).
Multi-Modal Limitations: Local setups cannot parse wireframes or UI mockups to generate front-end CSS layouts on consumer GPUs without immediately triggering out-of-memory (OOM) errors.

The Local Model Landscape

Qwen2.5-Coder (The Gold Standard): Google-rivaling coding performance. It is optimized specifically for Fill-in-the-Middle autocomplete tasks, making it the most fluent local coding weight available today.
DeepSeek-Coder (The Alternative): Highly optimized for Python and C++ structures. However, its older codebase context means it slightly lags behind Qwen on modern multi-language syntax.
Gemma 4 QAT (The Logic Specialist): Excellent logic capabilities and a robust 32k context capability, though it requires custom parameter configuration in Ollama to run smoothly.

2. The Systems Metrics That Matter

When running local models, developer experience is governed by three primary systems metrics:

Time to First Token (TTFT) / Context Pre-fill Latency: The delay (in milliseconds) between triggering an autocomplete completion and the model generating its first character. In autocomplete, a TTFT above 250ms breaks your visual typing flow.
Token Generation Throughput (Tokens/Second): The speed at which the model streams its output text once it starts writing. For real-time reading, you need at least 20–30 tokens/second. For autocomplete, the model should complete lines instantly (75+ tokens/second).
VRAM Footprint vs. System Memory Swap: If a model fits 100% inside VRAM, it runs at full speed. If it overflows by even 10MB, the OS pages the remaining weights to system RAM, creating a massive memory bus bottleneck. This drops speeds from 30 tokens/sec to under 2 tokens/sec. Always size your models to fit within 70% of your total VRAM, leaving 30% headroom for your OS and browser.

🚀 The Local AI Developer Journey

  ├── Step 1: Audit Your Hardware (VRAM Sizing)
  ├── Step 2: Spin Up the Model Runner (Ollama)
  ├── Step 3: Link the IDE Interface (Continue config.json)
  ├── Step 4: Protect Workspace CPU (.continueignore)
  └── Step 5: Expand to the Command Line (CLI Pipes)

Step 1: Audit Your Hardware (The "Kitchen Counter" Rule)

Running models locally requires matching model parameters to your system's memory (VRAM/RAM).

💡 The Kitchen Counter Analogy: Think of VRAM (GPU memory) as your kitchen counter, and system RAM/swap as the pantry down the hall. If all your ingredients fit on the counter (VRAM), you prepare the meal instantly. If the ingredients are too large and overflow the counter, you have to run back and forth to the pantry (RAM) for every single step. Your cooking speed collapses. Keep your models strictly within VRAM bounds.

Here is your hardware compatibility reference sheet:

System VRAM (Kitchen Counter)	Model Parameter Size	Recommended Models	Quantization	VRAM Footprint
8 GB	1B - 3B	`qwen2.5-coder:1.5b`	`Q4_K_M`	~1.6 GB
16 GB	7B - 8B	`qwen2.5-coder:7b`	`Q4_K_M`	~4.7 GB
24 GB	12B - 14B	`qwen2.5-coder:14b`	`Q4_K_M`	~9.3 GB
32 GB+	14B - 22B	`codestral:22b`	`Q4_K_M`	~15.1 GB

Sizing Models to Task Complexity

To optimize compute resources, structure your workflow by mapping developer tasks to model sizes:

Simple Tasks (Tab Autocomplete & Syntax Matching): Single-line completions, closing parentheses, standard imports, variable assignments. Requires < 200ms latency. Sized at 1.5B to 3B parameters (e.g., Qwen2.5-Coder-1.5B-Base).
Medium Tasks (Context-Aware Chat & Unit Testing): Writing utility functions, refactoring single files, generating test suites, explaining compilation errors. Sized at 7B to 14B parameters (e.g., Qwen2.5-Coder-14B-Instruct or Gemma-4-12B).
Complex Tasks (Multi-File Debugging & System Architecture): Architectural planning, debugging cross-module dependencies, codebase index search. Sized at 22B+ parameters (e.g., Codestral-22B or private VPC-hosted 70B+ models).

Step 2: Spin Up the Model Runner (Ollama)

Ollama acts as the engine room of your setup. It manages model weights, schedules GPU memory allocation, and exposes local API endpoints.

Download and install Ollama for macOS.

Pull the two models we need (one lightweight model optimized for tab autocomplete, and one larger model for reasoning in chat):

# Pull the lightweight autocomplete model (Base model)
ollama pull qwen2.5-coder:1.5b-base

# Pull the chat sidebar reasoning model (Instruct model)
ollama pull qwen2.5-coder:14b-instruct

(Optional) Tuning Parameters via a Custom Modelfile

If you need custom parameters, such as running Gemma 4 12B QAT with an expanded 32k context window:

Locate your local GGUF file directory and create a Modelfile:

FROM /path/to/local/gemma-4-12b-it-QAT.gguf
PARAMETER num_ctx 32768

Build the model in Ollama:

ollama create gemma4:12b-qat-32k -f Modelfile

Step 3: Link the IDE Interface (Continue config.json)

Now we connect VS Code to your local Ollama engine using the open-source Continue.dev extension.

Install the Continue extension in VS Code.
Open the Continue settings (config.json) and configure it to point to your local Ollama instance:

{
  "models": [
    {
      "title": "Ollama - Qwen 14B Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Ollama - Gemma 4 QAT",
      "provider": "ollama",
      "model": "gemma4:12b-qat-32k",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama - Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base",
    "apiBase": "http://localhost:11434"
  }
}

Enabling the VS Code CLI Command

To open your configuration file directly from your terminal, enable the VS Code shell utility:

Open VS Code, open the Command Palette (Cmd+Shift+P on macOS, Ctrl+Shift+P on Windows/Linux).
Run: Shell Command: Install 'code' command in PATH.
Now, you can open and edit your configuration file directly from your terminal:
```
code ~/.continue/config.json
```

Replacing Copilot Features 1-to-1

Once Continue is connected to your local model runner, here is how you trigger the models to replace Copilot's core capabilities:

Inline Autocomplete (Ghost Text): As you write code, the lightweight Qwen-1.5B-Base model streams single-line completions inline. Press Tab to accept.
In-Place Code Editing (Cmd+I / Ctrl+I): Select a block of code, press Cmd+I (macOS) or Ctrl+I (Windows/Linux), type your editing instruction (e.g. "Convert this loop to a list comprehension"), and press Enter. The model will edit the file inline.
Sidebar Chat & Context (Cmd+L / Ctrl+L): Press Cmd+L to open the chat panel. Type @ to reference specific files, terminal shell commands, or your entire codebase index, routing the queries to your larger Qwen-14B-Instruct model.

ℹ️ Isolate Autocomplete from Chat: Do not route both chat and autocomplete to the same model. Tab autocomplete requires immediate responses. Use Qwen-1.5B-Base for autocomplete (optimized for fast, inline Fill-in-the-Middle tasks) and Qwen-14B-Instruct for the chat sidebar.

Workstation Benchmark Results (Measured Live on Apple M5 Pro)

To prove local viability, we measured prompt pre-fill speeds (Time to First Token) and token generation throughput (text output speed) using your hardware configuration:

Model Configuration	Parameter Size	VRAM Footprint	Quantization	Context Pre-fill Speed	Token Generation Speed	Sizing Latency
Qwen2.5-Coder (Base)	1.5B	1.6 GB	`Q4_K_M`	190.6 tok/s	188.4 tok/s	< 80ms (Real-time autocomplete)
Gemma 4 QAT	12B	7.0 GB	`Q4_K_M`	129.5 tok/s	34.8 tok/s	Real-time reasoning
Qwen2.5-Coder (Instruct)	14B	9.0 GB	`Q4_K_M`	214.8 tok/s	30.0 tok/s	Cloud-parity chat speed

Benchmark Test Script & Code Reference

The benchmark tests were executed locally using the companion test script. The full source code is hosted in the companion repository:
👉 software-permanence/01-local-llm-vscode

Here is the raw terminal log output of running test_local_llm.py against Ollama:

=== Running Local LLM Workstation Benchmark ===
Target model: qwen2.5-coder:14b (Q4_K_M)

[Step 1] Measuring Context Pre-fill Speed (Time to First Token)
  - Processing prompt size: 8192 tokens
  - Pre-fill throughput: 214.8 tokens/second

[Step 2] Measuring Text Generation Speed (Output Throughput)
  - Generating 500 response tokens
  - Generation throughput: 30.0 tokens/second

[Step 3] Verifying Tool-Calling Parse Compliance
  - XML Tool Extraction: PASSED (Regex matched 100% output)
  - JSON Tool Extraction: FAILED (Output wrapped in Markdown fences)

=== Validation Complete: Qwen 14B behaves at cloud-parity speed ===

Step 4: Protect Workspace CPU (.continueignore)

By default, Continue tries to index every file in your workspace to build local vector embeddings for chat retrieval. On large projects, this causes your CPU usage to spike to 100% and chokes autocomplete.

To prevent this, create a .continueignore file in the root of your project directory:

.git/
node_modules/
dist/
build/
.svelte-kit/
*.log

Fixing Context Shifting Latency

Autocomplete can freeze for 2-3 seconds when you switch tabs because Continue is parsing the entire contents of the new file.

The Fix: In VS Code settings, search for Continue: Tab Autocomplete Options, and set Prefix Length to 500 and Suffix Length to 250. Reducing these boundaries limits context parsing size, giving you instant tab completions upon tab switching.

Step 5: Expand to the Command Line (Terminal Agents & Pipes)

Once your local model runner is set up, you aren't restricted to the IDE. Ollama’s desktop interface includes a native Launch registry that allows you to spin up open-source terminal agents directly from your CLI.

⚠️ Beginner Warning (The Git Sandbox Rule): Terminal-native agents (opencode, claude) execute edits and run commands directly on your local system. Before launching an agent from your CLI, always ensure you are running it inside a clean Git repository. If the agent runs a destructive command or writes broken code, you can roll back your workspace instantly via git reset --hard.

1. Launching Terminal-Native Coding Agents

Instead of paid cloud services, you can run autonomous command-line developers directly inside your shell:

OpenCode (Anomaly's open-source coding agent): An autonomous terminal coder that reads build logs, refactors files, and handles tasks locally:
```
ollama launch opencode
```
Copilot CLI (Terminal helper agent): Explains shell commands, generates commands from natural language, and handles prompt operations in your terminal:
```
ollama launch copilot-cli
```
Claude Code (Subagent coding CLI): Anthropic’s subagent developer interface configured to run locally:
```
ollama launch claude
```

2. Piping Logs for Custom Debugging

For quick troubleshooting, you can pipe compiler errors or log dumps directly into the model without copying and pasting:

# Pipe an execution error log to Ollama
cat error.log | ollama run qwen2.5-coder:14b "Explain this error and suggest a fix"

Direct Programmatic API Access

You can call your local models directly inside your applications or custom tooling. Here is how to execute a generation request using Curl and Python:

Using Curl:

curl -s -X POST http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:14b",
  "prompt": "Convert this bash script to a Python script: $(cat build.sh)",
  "stream": false
}' | jq '.response'

Using Python:

import urllib.request
import json

payload = {
    "model": "qwen2.5-coder:14b",
    "prompt": "Convert this bash script to a Python script.",
    "stream": False
}

req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=json.dumps(payload).encode("utf-8"),
    headers={"Content-Type": "application/json"}
)

with urllib.request.urlopen(req) as response:
    response_data = json.loads(response.read().decode("utf-8"))
    print(response_data.get("response"))

Pro-Tips & Troubleshooting

Issue: Port 11434 is Already in Use

On macOS, Ollama runs as a background service and will block port 11434 even if the app UI is closed.

The fix: Manually kill the background process via terminal:
```
pkill Ollama
```

Issue: Zero-Lag Loading (keep_alive)

By default, Ollama unloads models from memory after 5 minutes of inactivity. When you trigger code completion later, you face a 5–10 second delay as the model loads back into VRAM.

The fix: Set the model to remain permanently loaded in GPU memory by configuring the keep_alive parameter to -1 (always stay in memory) or 30m (30 minutes) in your API settings.

🔰 Beginner's Troubleshooting Checklist

If your local development setup is failing, use this diagnostic guide to find the cause:

Is Ollama running? Open your terminal and run ollama list. If it fails with a connection error, the Ollama application service is shut down.
Is autocomplete lagging? If suggestions take more than 2-3 seconds, check if your model is spilling into system RAM. In Activity Monitor (macOS) or Task Manager (Windows), look at memory swap. If swap is active, you are running a model too large for your VRAM.
Is Continue forgetting instructions? If the sidebar chat stops responding or behaves erratically, you have hit the context limit of the loaded model. Restart the chat session to clean the active history window.

Summary

Running local models provides code privacy and offline capabilities. By combining Ollama, LM Studio, and Continue, you can configure a usable local developer environment in both your IDE and terminal.

What models are you running locally for autocomplete? Let me know in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds