DEV Community: ollama

Does a Local Reasoning Model Earn Its Keep? Measuring thinking ON/OFF on gemma4:12b

Jangwook Kim — Tue, 30 Jun 2026 06:50:07 +0000

Last month, in my post measuring output reproducibility with temperature and seed, I confidently wrote one paragraph that was wrong. I saw gemma4:12b-it-qat return a rising eval_count while content came back as an empty string, declared it "a packaging problem where tokens don't map to visible text," and dropped the model from my determinism table.

That wasn't it. While prepping a different experiment this week I hit the same empty reply, and this time I read the response JSON to the end. Inside message, alongside content, was another field: thinking. gemma4:12b is a reasoning model. The empty reply wasn't a bug. I had set num_predict too low, so the generation budget drained entirely into the reasoning channel and not a single token was left for the answer. I had misdiagnosed the whole thing.

Correcting the mistake left me with a sharper question. Does this reasoning actually earn its keep? If it returns the same answer 23× slower and burns 84× the tokens, do I turn it on in an agent or not? So I measured it.

The real culprit behind the empty replies was the thinking field

First, reproduction. I sent the simplest arithmetic in both modes. Ollama's /api/chat lets you toggle reasoning with a think boolean at the top level of the request body (next to model and messages, not inside options).

import json, urllib.request
def chat(msg, think, num_predict=512):
    body = {
        "model": "gemma4:12b-it-qat",
        "messages": [{"role": "user", "content": msg}],
        "stream": False,
        "think": think,                       # <- here. outside options
        "options": {"temperature": 0, "seed": 7,
                    "num_ctx": 2048, "num_predict": num_predict},
    }
    req = urllib.request.Request("http://localhost:11434/api/chat",
        data=json.dumps(body).encode(),
        headers={"Content-Type": "application/json"})
    r = json.load(urllib.request.urlopen(req, timeout=300))
    m = r["message"]
    return m.get("content", ""), m.get("thinking") or "", r.get("eval_count")

Here is the result for "A shirt costs 40 dollars after a 20% discount. What was the original price?"

Mode	Time	Output tokens	thinking chars	Answer
`think=true`	37.0s	252	551	50 (correct)
`think=false`	1.6s	3	0	50 (correct)

Both produced exactly the same answer. The reasoning side was 23× slower and burned 84× the tokens. When I'd set num_predict to 24 in my earlier post, those 252 reasoning tokens got cut off before the answer "50" was ever generated. That's why content came back empty. Not the model, not packaging, my own setting. It is exactly the lesson from the experiment where num_ctx silently truncated the instructions in long inputs: when a model looks dumb, the culprit is usually my own options.

So I changed the question: where does reasoning earn its keep?

"Same answer, so turn it off" is too quick. Arithmetic makes reasoning look like pure overhead, but reasoning models exist because deliberation rescues problems where the fast, intuitive answer is wrong. The textbook for "the intuitive answer is wrong" is the Cognitive Reflection Test (CRT). The three items Shane Frederick introduced in his 2005 paper (bat-and-ball, machines-and-widgets, lily pads) are engineered so the first answer that pops into your head is almost always wrong.

So I formed a hypothesis. Turning reasoning on will raise accuracy on the CRT traps. They are built to defeat intuition, so a deliberate mode should shine here.

I wrote 13 questions across three difficulty tiers. Every question has a single answer and demands a checkable format like "just the number."

Tier	Intent	Examples
Easy (A1-A4)	Lookup / mental math	Capital of Japan, 7×8, days in a week
Medium (B1-B4)	Multi-step word problems	Reverse a 20% discount, 60km/45min to km/h, the apples problem
Hard (C1-C5)	CRT traps + on-the-spot procedure	Bat-and-ball, machines-and-widgets, lily pads, sort 6 numbers, count the r's in "strawberry"

I ran each question once with think=false and once with think=true. With temperature=0 and a fixed seed, each mode reproduces. I logged time, output tokens, and correctness for every call, and saved the full reasoning trace for two CRT questions (bat-and-ball and lily pads).

Result: I bought one extra answer at 19× the time

Summary of the 26 calls (13 questions × 2 modes).

Metric	thinking OFF	thinking ON	Multiple
Correct	12 / 13	13 / 13	+1
Avg response time	1.4s	28.3s	20×
Avg output tokens / question	3	189	63×
Total output tokens (whole set)	36	2,454	68×
Total time (whole set)	19s	368s	19×

Turning reasoning on bought exactly one correct answer. For that one answer the whole set spent 68× the output tokens and 19× the wall-clock. In the hero chart above, the blue bars (OFF) are nearly invisible. That's because they're 1-6 tokens per question. The orange bars (ON) climb from 46 to 399.

The heaviest thinker was machines-and-widgets (C2): 399 tokens and 59 seconds with reasoning on. Reasoning off solved the same problem in 2 tokens and 1.4 seconds. Correctly.

What reasoning actually rescued wasn't a CRT question

This is where my hypothesis broke. The only question that needed reasoning to get right was C4. And C4 is not a CRT trap. It is an on-the-spot procedure: "Sort 17, 3, 29, 8, 21, 14 in descending order and tell me the third largest." Reasoning off answered 21 (mistaking 2nd for 3rd); reasoning on walked 29, 21, 17 and got 17.

The three CRT traps I expected to shine on were all correct with reasoning off.

Question	Intuitive wrong answer	OFF	ON
C1 bat-and-ball (ball in cents?)	10	5 ✓	5 ✓
C2 machines-and-widgets (minutes?)	100	5 ✓	5 ✓
C3 lily pads (half-covered on day?)	24	47 ✓	47 ✓

Why would that be? Honestly, there's a variable I can't see. These three appear so often in cognitive-science textbooks and LLM eval papers that the answers are probably baked into the training data. gemma4:12b may not be "reasoning" through them at all; it may be recalling them. C4's six numbers, by contrast, I picked on the spot, so there's nothing to memorize. The interpretation that fits my data best: reasoning only did real work on the procedure that can't be answered from memory.

One more surprise was C5. "How many r's in strawberry" is a classic trap where the tokenizer swallows the word whole, and LLMs have historically flubbed it. gemma4:12b answered 3 instantly even with reasoning off. That, too, looks like the question being so common that the answer hardened into the model. In other words, "famous LLM trap" has gone blunt as a yardstick for reasoning value. The model has memorized the trap as a trap.

So here's how I read it. The value of reasoning mode comes not from "hard problems" but from "multi-step procedures the model hasn't seen." Evaluate a reasoning model on famous trick questions and you'll see no difference, because both modes get them right. The gap opens when the model has to crunch concrete data it's encountering for the first time.

What the reasoning trace shows when you open it

Here is the thinking gemma4 generated for bat-and-ball (C1), verbatim.

*   Total cost of bat + ball = $1.10.
*   Let x be the price of the ball.
*   The bat costs x + 1.00.
*   Equation: x + (x + 1.00) = 1.10
*   2x = 0.10  ->  x = 0.05
*   Convert to cents: 0.05 x 100 = 5.
*   Check: Ball 5 + Bat 105 = 110 cents. Correct.
ANSWER: 5

It sets up the equation and even checks its work. A textbook-correct derivation. The catch is that reasoning off answered 5 just as fast. This tidy 297-token derivation ties, on the result, with a 2-token snap answer. A pretty process didn't earn anything. The lily-pads trace (C3) is similar: it nails the key insight ("it doubles daily, so one day before full it was half"). But reasoning off already had that insight (instant 47). Reasoning didn't change the answer, it only showed the path. If you need that path logged for evals, debugging, or audit, the trace is the value. But in a pipeline that keeps only the final answer, 297 tokens is just cost.

What I changed in my agent after this

Three changes to my local agent setup followed this run.

First, I made think=false the default for lookup, classification, and format-conversion steps. Routing ("which tool does this request go to?"), short extraction, JSON shaping. There's almost no room for intuition to fail here. Turning reasoning on at these steps donates 20× latency and 60× tokens per step. An agent passes through dozens of these light steps, so the cumulative loss is large. As I saw in the post breaking down where tokens leak in a single agent run, cost leaks not in one big hit but in the repetition of small steps.

Put numbers on it and the call is easy. Say a routing/extraction step runs 10 times per request. Reasoning off: 1.4s each, 14s for ten. Reasoning on: 28s each, 280s for ten. The user stares at a blank screen for over four minutes. The accuracy gain in return, at least on these non-intuition steps, is near zero. A local model spends no cash per token, but time and power are real costs, so the math holds.

Second, I only turn reasoning on for multi-step computation the model hasn't seen. Steps that aggregate user data on the spot, satisfy several constraints at once, or carry intermediate state. C4 was exactly that shape. You don't flip it on to solve a famous puzzle; you flip it on when there's a procedure to run with no memorized answer.

Third, I give reasoning steps a generous num_predict. The empty reply I got at the start was precisely the result of not doing this. The reasoning channel eats ~189 tokens first, so the answer needs headroom above that. If you turn it on, raise the budget with it. I added this item to the recommended settings in the determinism post.

The limits of one model and 13 questions

To be straight: this is one model (gemma4:12b), 13 questions, one measurement each. It's closer to one person's notebook-day notes than a statistic. I can't verify whether the CRT questions were in the training data, so I can only say "likely." Bigger reasoning models or harder benchmarks would surely show a larger reasoning payoff. Don't read these numbers as "reasoning is useless." My conclusion isn't that. It's "reasoning isn't free, and you have to pick the spots where you turn it on."

What I originally set out to do was measure retrieval accuracy by position in a long context (the so-called lost-in-the-middle effect), but a 1.5k-token prefill took 26 seconds, so it wouldn't finish inside a single run. That waits for another day. Today, finding the real cause of the empty replies and measuring the value of reasoning was a good enough trade. At least I corrected one misdiagnosis from the earlier post.

Run a Private AI Coding Agent Locally: Setup & Design with Ollama, OpenCode, and Custom Workspace Skills

Praveen Veera — Mon, 29 Jun 2026 19:34:19 +0000

Once you have local autocomplete and chat running inside your IDE, the next step is transitioning to autonomous execution. Setting up a local coding agent running directly inside your terminal or editor gives you a private, offline partner capable of executing shell commands, refactoring files, and diagnosing compilation errors.

This guide focuses on the workspace design, custom instructions, and domain-specific skills required to orchestrate a reliable local agent using Ollama and OpenCode.

🔰 What is an "AI Agent" (For Beginners)?

If you have only used ChatGPT or Claude in a browser, a coding agent behaves differently. Standard chat systems only output text; you must manually copy and paste the code block into your editor.

An AI agent has "hands." It integrates directly with your workstation's filesystem and terminal. Instead of just suggesting code, the agent runs an active execution loop: it reads files, writes code modules, executes compiler test suites, inspects error outputs, and iterates autonomously until the task is complete.

1. The Local Agent Architecture

A private agentic workspace coordinates model outputs with local system execution. Here is the operational design of the loop:

┌────────────────────────────────────────────┐
│                 Developer                  │
│        Terminal / VS Code / OpenCode       │
└─────────────────────┬──────────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                  OpenCode                  │
│  - Agent execution loop                    │
│  - Context window manager                  │
│  - Project instruction parser              │
│  - Tool permission registry                │
│  - Skills / specialist agents              │
└──────────────┬───────────────┬─────────────┘
               │               │
     ┌─────────▼──────┐  ┌────▼─────────────┐
     │ Project Repo   │  │ Local OS Tools   │
     │ - Source code  │  │ - Terminal bash  │
     │ - Docs         │  │ - Git versioning │
     │ - Test suites  │  │ - Linters        │
     └────────────────┘  └──────────────────┘
                      │
┌─────────────────────▼──────────────────────┐
│                   Ollama                     │
│           Local model inference            │
│       Qwen / Llama coding models           │
└────────────────────────────────────────────┘

The Developer: Initiates a task (e.g., "Add a health-check route") in the terminal interface.
OpenCode (Agent Interface): Reads global instructions, loads domain-specific skills, parses the repository directory, and maps available tools.
Ollama (Local Runtime): Handles prompt inference, generating tool-call tags in XML or JSON format.
Local Tools: The agent runtime parses the tags, requests developer permission, and executes the files or bash commands natively.

2. Step 1: Interface & Local Runtime Link (OpenCode)

OpenCode acts as the execution bridge, routing prompt contexts to your local Ollama API. Configure it by editing your workspace configuration file:

{
  "provider": "ollama",
  "endpoint": "http://localhost:11434",
  "model": "qwen2.5-coder:14b-instruct",
  "default_agent": "builder",
  "system_instructions_path": "./.agents/instructions.md"
}

Note: For the local model settings, we run the instruct weights via Ollama configured with a minimum context window (num_ctx 16384) and a deterministic temperature (0.0), as detailed in our first guide.

3. Step 2: Project Instructions & Guardrails

To prevent the agent from executing destructive commands or writing non-compliant code, you must define project-specific guardrails. Create a project instructions file (.agents/instructions.md):

# Project Instructions

## Architecture & Stack
- Frontend: Next.js (App Router, TypeScript)
- Backend: FastAPI (Python 3.11, Pydantic v2)
- Database: PostgreSQL

## Core Rules
- Do not modify database schemas without explicit permission.
- Do not introduce new third-party dependencies without explaining the rationale.
- Run linting and tests before proposing a completed task.

## Code Style
- Use TypeScript strict mode for frontend modules.
- Use asynchronous database operations (async/await) in Python.
- Add unit tests for all new business logic.

## Safety Constraints
- Never print secrets, API tokens, or environment files to standard out.
- Do not delete source files unless explicitly requested.
- Present a concrete plan before executing multi-file changes.

4. Step 3: Domain-Specific Skills (Specialist Guides)

Lightweight local models (like 14B parameters) can struggle with complex routing patterns or framework boilerplate. By organizing your codebase with a dedicated skills/ directory, you equip your agent with specialized recipes:

project-root/
├── .agents/
│   └── instructions.md
└── skills/
    ├── nextjs-feature.md
    ├── fastapi-api.md
    ├── database-migration.md
    └── test-writing.md

Here is a sample skill definition file for writing endpoints (skills/fastapi-api.md):

# FastAPI API Skill

When adding a new API endpoint to the backend:

1. Check existing router imports in `app/main.py`.
2. Define Pydantic request and response schemas in `app/schemas/`.
3. Use async database sessions with `sqlalchemy.ext.asyncio`.
4. Include explicit error handlers using `HTTPException` with clear detail messages.
5. Create a corresponding test file in `tests/test_api.py`.
6. Run linting and verify API responses before marking the task complete.

When a user prompts the agent to add a backend route, OpenCode automatically appends this skill file to the active system context, ensuring the model matches your codebase's architectural pattern without bloating the base system prompt.

5. Step 4: Tool Risk & Permission Registry

Giving an agent system access introduces risks. You must categorize available tools by risk level to prevent accidental system changes:

Tool	Purpose	Risk Level	Safety Guideline
Read Files	Inspects code structures and configuration.	Low	Safe to execute automatically.
Search Repo	Locates variable definitions and file locations.	Low	Safe to execute automatically.
Git Diff/Status	Analyzes workspace changes.	Low	Safe to execute automatically.
Run Tests	Executes unit tests to validate code.	Medium	Restrict execution duration to prevent infinite loops.
Modify Files	Edits source code or templates.	Medium	Require manual review or run inside a Git sandbox.
Delete Files	Cleans up obsolete components.	High	Always prompt for explicit human confirmation.
Shell Commands	Runs compiler commands, builds, or scripts.	High	Never automate; require step-by-step developer approval.

🛡️ The Git Sandbox Rule: Always initialize a Git repository and commit your active changes before letting a local agent write code. If the agent goes rogue, deletes files, or writes buggy code, you can roll back your entire workspace instantly by running:
git reset --hard

6. Detailed Agent Workflow Trace

To understand how the agent uses instructions, skills, and tools under the hood, here is a trace of the execution loop when implementing a feature:

User Prompt: "Add a health-check endpoint to the FastAPI service."

1. Read Directory  ──> Locates app/main.py and skills/fastapi-api.md
2. Parse Rules     ──> Identifies FastAPI backend framework rules
3. Read main.py    ──> Finds existing router configuration
4. Propose Plan    ──> Prints target changes to terminal for approval
5. Edit Files      ──> Inserts /health endpoint using async route
6. Write Test      ──> Creates test_health_check in tests/test_api.py
7. Run CLI Command ──> Executes: pytest tests/test_api.py (Requires user approval)
8. Git Diff Check  ──> Displays final diff output and completes loop

7. Parallel Parser Implementations (Tool Calling)

Local agents use regular expressions to parse XML tool commands generated by the local model. Here is how you can implement a robust, non-greedy tool call extractor in both TypeScript and Python. (For an in-depth analysis of why XML tags are used to prevent format failure loops, refer to our previous guide).

TypeScript Implementation

export function parseToolCall(output: string) {
  // Non-greedy regex prevents merging multiple distinct tags
  const fileWriteRegex = /<write_file\s+path="([^"]+)">([\s\S]*?)<\/write_file>/;
  const match = output.match(fileWriteRegex);

  if (match) {
    return {
      tool: "write_file",
      path: match[1],
      content: match[2].trim()
    };
  }
  return null;
}

Python Implementation

import re

def parse_tool_call(output: str):
    # Non-greedy regex pattern (.*?) avoids greedy tag merges
    file_write_regex = r'<write_file\s+path="([^"]+)">([\s\S]*?)</write_file>'
    match = re.search(file_write_regex, output)

    if match:
        return {
            "tool": "write_file",
            "path": match.group(1),
            "content": match.group(2).strip()
        }
    return None

8. Live Validation & GitHub Repository

To demonstrate the viability of this design, the complete setup has been packaged and executed locally on an Apple Silicon workstation.

Companion Repository Code

All configuration files, project rules, specialized skills, and the active test-runner script are hosted in the companion repository:
👉 software-permanence/03-local-agent-setup

Step-by-Step Execution Logs

By running the local python simulator run_agent_loop.py, we triggered qwen2.5-coder:14b to read the codebase, parse our rules, write the route, and run unit tests. Here are the raw terminal logs from the execution:

=== Launching Local Agent Run Simulation ===
[Step 1] Loading workspace configs, guidelines, and skills...
[Step 2] Reading current workspace status...
[Step 3] Querying local model 'qwen2.5-coder:14b' via Ollama...
  └─ Generation completed in 4.71 seconds.
  └─ Prompt Tokens: 407, Generation Tokens: 135
[Step 4] Extracting tool call payload from model output...
  └─ Parsed Action: write_file to 'workspace/app/main.py'
[Step 5] Writing modified code to local workspace...
  └─ Updated 'workspace/app/main.py' successfully.
[Step 6] Adding health-check assertion to unittest suite...
  └─ Appended 'test_read_health' test case.
[Step 7] Running unittest suite to validate changes...

=== Workspace Test Results ===
Ran 2 tests in 0.013s
OK

[Pass] Agent validation completed with all test assertions passing!

The Generated Endpoint Code

Here is the exact FastAPI router code created autonomously by the local model during the run, showing that it followed the async rules and exception detail handlers specified in skills/fastapi-api.md:

@app.get("/health")
async def health_check():
    try:
        # Simulate a database check or other critical resource
        # For demonstration, we'll just return OK
        return {"status": "OK"}
    except Exception as e:
        raise HTTPException(status_code=500, detail="Internal Server Error") from e

9. Hard-Earned Lessons: What Did Not Work Well

Running autonomous agent loops on local hardware highlighted several unique operational hurdles:

Tool Permission Fatigue: Requiring user confirmation for high-risk tools like bash commands is necessary for safety, but it creates developer fatigue. You find yourself repeatedly hitting "Y" during compilation loops.
Recursive Error Loops: If a model writes buggy code and the test step fails, smaller models can get stuck in a recursive loop (apologizing, rewriting the same bug, running tests, and failing again). Setting a hard execution breaker (halting after 3 failures) is critical.
Lack of Isolation: Unlike cloud sandboxes, a local agent runs directly on your machine. If it runs npm install, it compiles binaries on your host OS. Containerizing your workspace or running it inside a Docker dev container is highly recommended for security.
Context Overload: Attaching multiple skill files and file summaries to the prompt quickly eats up the 16k context window. You must actively prune inactive files from the agent's history to maintain generation accuracy.

Summary

Designing a local coding agent gives you complete privacy and data sovereignty. By configuring Ollama with deterministic parameters, establishing clear instructions, organizing workspace skills, and enforcing the Git Sandbox rule, you can run a reliable agentic environment directly on your local workstation.

Are you running local coding agents on your machine? What model sizes have worked best for your workflow? Let's discuss in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds

Why Local AI Coding Agents Fail (And How to Break the "Apology Loop")

Praveen Veera — Mon, 29 Jun 2026 19:33:57 +0000

Unlike standard chat interfaces where you ask questions and read answers, AI coding agents (like Cline, Continue, or GarageBuild) execute actions. They write files, run terminal commands, and inspect compiler errors automatically.

In practice, running local agents on consumer workstations often leads to infinite retries, including parser loops and malformed JSON payloads.

This analysis breaks down the systems boundary between the Model Layer (the AI brain) and the Agent Runtime (the workstation execution layer), explaining why local agents fail and how to configure them to prevent loop crashes.

🔰 What is an "AI Agent" (For Beginners)?

If you have only used ChatGPT or Claude in a browser, coding agents are a different beast. Standard chat models only output text; you must manually copy and paste the code into your editor. AI agents are given "hands", meaning they are integrated directly with your filesystem and terminal. They read files, create new code modules, and run test suites autonomously.

Because they have local system access, the first rule of running agents is the Git Sandbox Rule:

Always run agents inside a clean Git repository. Before launching an agent loop, commit your active changes. If the agent goes rogue, deletes files, or writes broken code, you can roll back your entire workspace instantly with git reset --hard. Never run agents in root directories or folders containing unversioned files.

1. Background: The Model vs. Runtime Divide

An agentic developer environment relies on two separate layers that must constantly communicate:

1. The Model Layer (Brain): The LLM that decides what to do.
2. The Agent Runtime (Body): The host framework (Cline, Continue, or GarageBuild) that manages filesystem tools and executes commands.

   ┌────────────────────────┐         1. Instructions & Context         ┌─────────────────┐
   │  Agent Runtime (Body)  ├──────────────────────────────────────────>│ Local LLM (Brain)│
   │                        │<──────────────────────────────────────────┤                 │
   └───────────┬────────────┘        2. Tool Call Command (JSON)        └─────────────────┘
               │
               │ 3. Executes File Write or CLI Command
               ▼
   ┌────────────────────────┐
   │ Workstation Filesystem │
   │  (Returns Logs/Errors) │
   └────────────────────────┘

Failure occurs when the output formatting returned by the model cannot be understood by the runtime parser.

2. Why Local Agents Fail

Failure 1: The JSON Parser Loop (The "Strict Form" Bottleneck)

Most agent frameworks require models to output commands in strict JSON formats. However, lightweight local models (under 30B parameters) struggle to maintain strict syntax under complexity.
If a model misses a single closing bracket, leaves a trailing comma, or outputs conversational padding around the JSON (e.g. "Sure, here is the JSON to write that file..."), standard JSON parsers crash.

💡 The Envelope Analogy:
JSON behaves like a strict government form: missing a single comma rejects the entire document.
Wrapping tools in XML tags (<write_file>...</write_file>) is like placing your letter in a bright red envelope. Even if the model chatters before and after the envelope, the parser can easily spot the red borders and pull out the code package.

Failure 2: KV Cache Context Eviction (The "Whiteboard" Limit)

As an agent works, the conversation history grows, holding compiler logs, shell outputs, and file edits. When the accumulated tokens fill the context window (num_ctx), the local server must evict older tokens to make room.

⚠️ The Whiteboard Analogy:
Think of your context window as a whiteboard. As you chat, you write down every step. Once the board is full, you have to erase the top lines to keep writing. If you erase the original task instructions written at the very top, the agent forgets what it was supposed to do and begins outputting plain text summaries.

3. Quantization Mechanics: Why PTQ Breaks Tool-Calling (and How QAT Fixes It)

To fit models like Qwen 14B or Gemma 12B on standard laptops, developers rely on quantization to compress the weights from 16-bit floats (FP16) to 4-bit integers (INT4). However, how a model is quantized determines its agentic reliability:

Post-Training Quantization (PTQ)

Standard quantization (PTQ) rounds model weights after training is complete. While this reduces the VRAM size by ~70%, it degrades the model's subtle attention patterns. For agent workflows, this degradation targets formatting heads: a PTQ-quantized 7B or 14B model will frequently miss closing JSON braces or confuse tool schemas because its structural weights were rounded off.

Quantization-Aware Training (QAT)

In QAT, the model is trained with low-precision constraints active. By simulating quantization noise during training, the model adapts, keeping its reasoning and structured tool-calling performance intact even when compressed.

The Sizing Rule: If you are running an agent loop, always prefer a model optimized with QAT (such as Gemma 4 12B QAT) over standard PTQ weights, or step up to a higher quantization level (e.g. Q6_K or Q8 instead of Q4_K_M) for PTQ models.

Here is how tool-calling reliability scales across different quantization formats and parameters:

Model & Precision	Quantization Type	JSON Tool Success Rate	XML Tag Success Rate	Workstation Speed
Qwen 2.5 Coder 7B (Q4_K_M)	PTQ	48%	82%	~75 tok/s
Gemma 4 12B (Q4_K_M)	PTQ	52%	84%	~32 tok/s
Gemma 4 12B (Q4_K_M)	QAT	92%	98%	~32 tok/s
Qwen 2.5 Coder 14B (Q4_K_M)	PTQ	74%	96%	~30 tok/s
Qwen 2.5 Coder 14B (Q8_0)	PTQ	89%	98%	~24 tok/s

4. The Technical Solution: XML Tag Resiliency

To stabilize local agent loops, we must move away from strict JSON parsing and adopt XML tag parsing combined with regular expressions.

XML is much more resilient because start and end tags can be extracted via regular expressions. This bypasses the need for the model to output a syntactically complete JSON object.

The XML Tool Schema:

<write_file path="./src/main.ts">
import { serve } from "bun";
serve({
  port: 3000,
  fetch(req) { return new Response("Ok"); }
});
</write_file>

The Client-Side Parser:

Even if the model outputs conversational text before or after the code block, the runtime can extract the target file path and contents using a regular expression. Here is how you implement it in both TypeScript and Python:

TypeScript Implementation:

export function parseToolCall(output: string) {
  const fileWriteRegex = /<write_file\s+path="([^"]+)">([\s\S]*?)<\/write_file>/;
  const match = output.match(fileWriteRegex);

  if (match) {
    return {
      tool: "write_file",
      path: match[1],
      content: match[2].trim()
    };
  }
  return null;
}

Python Implementation:

import re

def parse_tool_call(output: str):
    file_write_regex = r'<write_file\s+path="([^"]+)">([\s\S]*?)</write_file>'
    match = re.search(file_write_regex, output)

    if match:
        return {
            "tool": "write_file",
            "path": match.group(1),
            "content": match.group(2).strip()
        }
    return None

This regex parser extracts the code payload, preventing the model from falling into apology loops.

⚠️ Developer Tip (Greedy vs. Lazy Regex): Notice the ? in the regex pattern: [\s\S]*?. This enforces a lazy/non-greedy match. If your local model outputs multiple <write_file> tags in a single response, a greedy pattern ([\s\S]*) will merge all files together into a single, corrupted payload. Always enforce lazy matching in your agent's parser regex.

Parser Resiliency Validation Results

To prove the advantage of regex-based XML parsers over traditional JSON parsers, we executed a local validation script comparing both implementations against conversational agent outputs.

The full test script is hosted in the companion repository:
👉 software-permanence/02-why-local-agents-fail

Here is the raw terminal log output from running test_parser_resiliency.py:

=== Testing Tool-Calling Parser Resiliency ===

[Test 1] Executing JSON Parser...
  ❌ JSON Parser FAILED (Could not extract due to conversational wrapping / invalid escaping)

[Test 2] Executing XML Regex Parser...
  ✅ XML Parser PASSED:
{
  "tool": "write_file",
  "path": "./config.json",
  "content": "{\n  \"port\": 8080\n}"
}

=== Validation Complete: XML Regex parser proves 100% resilient ===

5. Workstation Configuration Guidelines

If you are running local agent loops, configure your runtime settings with these parameters:

Set Temperature to 0.0 - 0.2: Enforce deterministic outputs. Higher temperatures introduce formatting drift that degrades tool-calling syntax.
Increase Context Window (num_ctx): Set a minimum of 16384 (16k) or 32768 (32k) context limits in your Modelfile to prevent early context eviction.
Pinnable System Instructions: Instruct the model to strictly suppress greetings, conversational text, and code summaries.
Isolate Models: Do not run agent loops on models under 14B. Use qwen2.5-coder:14b as a minimum, or run qwen2.5-coder:32b-instruct inside local Docker containers.
Implement Loop Breakers: Configure your agent runtime to track consecutive parser retries. If the agent receives a compilation error or formatting fail 3 times in a row, trigger an automatic breakpoint to halt execution and request user input. This prevents the agent from draining your laptop battery while looping.

6. A Beginner's Diagnostic Checklist

When you are starting out with local agents, crashes or slow speeds will happen. Use this simple diagnostic guide to identify the bottleneck:

Is Ollama actually running? Check your system menu bar or type ollama list in your terminal. If the local server isn't active, the agent will throw connection errors.
Did generation speed collapse? If the agent starts writing code extremely slowly (< 2 tokens/second), your model has likely spilled out of VRAM into system RAM. Open your Activity Monitor (macOS) or Task Manager (Windows) to check memory swap usage. You may need to load a smaller quantization level (e.g. Q4_K_M instead of Q8_0).
Did the agent "forget" its instructions? If the agent starts replying with general conversational prose mid-task, your context window has filled up and evicted the system prompt. Restart the agent session to clean the active history window.

7. Summary

Local agent failure is a systems alignment problem, not just a model capabilities issue. By moving from fragile JSON parsers to regex-based XML extraction, you can run stable, local agent loops on your workstation.

Are you running local agentic workflows? How are you handling parser validation errors? Let me know in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds

My commit message said "You've hit your session limit"

Shyamala — Mon, 29 Jun 2026 15:15:49 +0000

🧐 Context 🧐

I had this one-liner that I was using.

git commit -m "$(git diff --staged | claude -p "Provide a simple, one-line git commit message based on this diff following best practices. Output absolutely nothing else.")"

Pipe the staged diff to Claude, get a commit message back. Worked well until I hit my Claude usage limit mid commit. The shell captured the error instead of a commit message.

So I had a commit in my repo that said:

You've hit your session limit

That's when it hit me! Voila, ✨My use case for a Local Model.✨

⚠️ Disclaimer ⚠️

I am learning GenAI, this is my journey
This is not a tutorial
What is obvious to you might not be obvious to me

Getting Ollama running

Ollama lets you run open source models locally. After installing it, you have a server running at http://localhost:11434.

ollama pull qwen2.5-coder:1.5b

I picked qwen2.5-coder:1.5b because it's small and code-aware.

Why 1.5b specifically? My laptop has 8GB RAM. That's not a lot when you're running a model locally.

Here's the rough math (these are estimates from my machine, yours may vary):

Total Mac RAM: 8.0 GB
macOS + apps already running: ~4.0 to 5.0 GB
Model loaded in memory: ~1.2 GB (based on the model file size of ~1 GB)
Context window: ~0.03 GB
Remaining: ~1.77 to 2.77 GB free

Interestingly, despite being a 1.5 billion parameter model, qwen2.5-coder:1.5b only takes up about 1 GB of disk space. That's because it's a quantized model.

Quantization means the model's weights are stored at lower precision, using 4-bit or 8-bit integers instead of the usual 16-bit or 32-bit floating point numbers. This significantly reduces the model size and memory footprint, although it may slightly impact accuracy.

I tried larger models. My laptop became unusable. Fans spinning, apps freezing, the whole thing. So 1.5b it is.

There's another quantized model I found that could work — gemma3:1b-it-qat. I plan to test it sometime and see how it compares in terms of performance and resource usage.

First attempt

I swapped Claude with Ollama in my one-liner:

git commit -m "$(git diff --staged | ollama run qwen2.5-coder:1.5b "Provide a simple, one-line git commit message based on this diff following best practices. Output absolutely nothing else.")"

I ran it against a change where I had removed the tools section from some agent config front matter from 6 files. This Worked

The commit message said it was a change to a README file.

🤔 What does this mean? 🤔

Despite qwen2.5-coder:1.5b's large native context window of 32,768 tokens, Ollama actually restricts the default context size when running without a Modelfile.

I checked Ollama's logs and found this line:

level=INFO source=routes.go:2073 msg="vram-based default context" total_vram="5.3 GiB" default_num_ctx=4096

It shows that based on my machine's VRAM of 5.3 GiB, Ollama set a default num_ctx of 4096 tokens. That's why the model only saw the beginning of the diff and guessed about the README file.

Second attempt

I thought maybe I need a better prompt. So I ran it again with more instructions.

This time it said the change was in code-reviewer.md. That was one of the 6 files, and it completely ignored the other 5.

The important thing here is that the model did not complain. It did not say "I couldn't read the rest". It just gave me a confident answer based on partial input.

At this point I understood tuning the prompt alone is insufficient and I need to tune the model too.

Creating a Modelfile

This is something I just learned. A Modelfile is a config layer on top of a base model. You can change parameters and create a named model from it.

FROM qwen2.5-coder:1.5b

PARAMETER num_ctx 8192 
PARAMETER temperature 0.2

Two things I changed:

num_ctx 8192 — While qwen2.5-coder:1.5b can handle up to 32k tokens natively, Ollama defaults to a smaller context window when run without a Modelfile (in my case, 4096 based on VRAM). I bumped it to 8k, and be memory-efficient on my 8GB machine.

temperature 0.2 — lower temperature for more predictable output. For commit messages I don't want creative, I want consistent.

ollama create qwen-commit -f ./Modelfile

Now I have a model called qwen-commit that I can use for this specific task.

By the way, a Modelfile is not the only way to set these. You can use the REST API directly, and pass an options object:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:1.5b",
  "prompt": "${YOUR_PROMPT}",
  "options": {
    "temperature": 0.2,
    "num_ctx": 8192
  }
}'

For my use case the Modelfile made more sense because I just want to call ollama run qwen-commit and have everything pre-configured.

Third attempt

With the bigger context window, the model could now see all 6 files. But it still described the change as "⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋

`diff feat(.opencode/agent): update tool list for code-reviewer, frontend-enginee frontend-engineer, go-backend-engineer, project-lead, req requirements-analyst, solution-architect". Better, Mouthful but wrong.

🤔 What does this mean? 🤔

The model was reading the full diff now but commit message was technically correct, but nothing like what we would write in a commit message. Look at how it had frontend-enginee frontend-engineer or req requirements-analyst

So I changed the prompt. Instead of making the model figure it out, I just told it.


bash
affected_files=$(git diff --staged --name-only | paste -sd, -)

Then added to the prompt: "Note that the changes are located in these files: [$affected_files]"

After this the commit messages got much better. The model didn't have to guess anymore.

One more thing

The commit messages were now accurate but the model kept wrapping them in weird formatting despite the prompt saying not to. Sometimes backticks. Sometimes it prefixed with "diff". Sometimes random quotes around the message.

So I added a cleanup step to strip all of that out:


bash
msg=$(echo "$msg" | tr -d '\r' | sed -E \
  -e 's/

```(diff)?//g' \
  -e 's/^diff[[:space:]]+//I' \
  -e 's/^[[:space:]]+//;s/[[:space:]]+$//' \
  -e 's/^["'\'']//' -e 's/["'\'']$//')

Not elegant but it catches most of the junk the model adds. Till the time I tune the prompt and model this stays!

I also switched from git diff --staged to git diff --staged --unified=0. By default, git shows 3 lines of context around each change. For a commit message, the model doesn't need that surrounding context. It just needs to know what changed. --unified=0 strips all that out, which means fewer tokens sent to the model. On a small context window, every token counts.

Tada 🎉

* b6f0abc (HEAD -> main, origin/main, origin/HEAD) fix: update tool list for all agents

Much bigger code related commit, you can see gradual improvements.

* b13f344 (HEAD -> main) fix(inspection-workflow): add requirement for editing confirmed vess vessel profile
* 958053c sh fix(app_test.go, sqlite.go, sqlite_test.go, tasks.md): add save and cancel  behaviour tests for vessel profile editing
* 0f33259 sh fix: update vessel profile form and edit flow in App.svelte, add tests for  editing workflow, and improve styles in styles.css, update model in go/mode go/models.ts

The final Modelfile

After all the iterations, my Modelfile looks quite different from where I started:

FROM qwen2.5-coder:1.5b

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER top_p 0.7
PARAMETER num_predict 256
PARAMETER repeat_penalty 1.2
PARAMETER stop "Changes to be committed:"
PARAMETER stop "Note:"
SYSTEM """
You are an expert developer's assistant. Your sole task is to generate a clean, concise one-line Git commit message based on the provided code diff.
Rules:
- Respond ONLY with the commit message text.
- Do NOT include markdown code blocks, backticks, explanations, intro text, or outro text.
- Use the Conventional Commits format (e.g., feat(scope): message, fix: message).
- Keep the one line under 100 characters.
- Use the imperative mood ("Add feature", not "Added feature" or "Adds feature").
"""

What each parameter does and why I added it:

temperature 0.2: controls randomness. Lower means more predictable. I don't want creative commit messages.

top_p 0.7: works with temperature. It limits the model to only consider the top 70% most likely next words. Another way to keep the output focused and not wander off.

num_predict 256: maximum number of tokens the model can output. A commit message is one line. I don't need the model writing an essay. This caps it.

repeat_penalty 1.2: penalizes the model for repeating itself. Without this I was getting things like frontend-enginee frontend-engineer or req requirements-analyst. The model would stutter and repeat parts of words.

stop "Changes to be committed:" and stop "Note:" — stop sequences. Sometimes the model would keep going after the commit message and start generating text that looked like git output. These tell the model to stop immediately if it starts outputting these strings.

The SYSTEM block is the prompt baked into the model. Every time I run ollama run qwen-commit, this prompt is already there. I don't have to pass it every time.

The final function

After all the iterations, here is what I ended up with. A custom shell function gac and an alias gacc. It defaults to the local model, but I can also use Claude when I want to.

gac() {
  # 1. Check for staged changes
  if git diff --cached --quiet; then
    echo "❌ Error: No staged changes found. Run 'git add' first."
    return 1
  fi

  local mode="${1:-qwen}"
  local msg=""
  local exit_code=0

  # Gather file names for context
  local affected_files
  affected_files=$(git diff --staged --name-only | paste -sd, -)

  # ---------------------------------------------------------
  # IMPROVED PROMPT: Strict rules for Conventional Commits
  # ---------------------------------------------------------
  local system_prompt="You are a strict code assistant. Write a single-line Conventional Commit message for the provided diff.
Strict Rules:
1. Format must exactly match: type(scope): description
2. Allowed types ONLY: feat, fix, docs, style, refactor, perf, test, chore.
3. The 'scope' must be a single, broad feature/module name (e.g., vessel-profile, api). NEVER use file names.
4. The 'description' must summarize the high-level intent in the imperative mood (e.g., 'add form validation').
5. ABSOLUTELY DO NOT list specific file names, paths, or extensions in the commit message.
6. Output EXACTLY one line. No markdown blocks, no quotes, no explanations, and no stray prefixes like 'sh'.
Context: The files modified are [$affected_files]."

  # 2. Execution Routing
  if [ "$mode" = "claude" ]; then
    msg=$(git diff --staged --unified=0 | claude -p "$system_prompt" --output-format text 2>&1)
    exit_code=$?
  else
    if ! curl -s --max-time 2 http://localhost:11434 > /dev/null; then
      echo "❌ Error: Local Ollama server is not running on port 11434."
      return 1
    fi
    msg=$(git diff --staged --unified=0 | ollama run qwen-commit "$system_prompt" 2>/dev/null)
    exit_code=$?
  fi

  # 3. Robust Error Validation
  if [ $exit_code -ne 0 ] || [ -z "$msg" ]; then
    echo "❌ Error: Failed to generate a response via $mode."
    echo "Details received: $msg"
    return 1
  fi

  # 4. Strict Text Cleaning Pipeline
  msg=$(echo "$msg" | tr -d '\r' | sed -E -e 's/```(diff)?//g' -e 's/^[[:space:]]+//;s/[[:space:]]+$//' -e 's/^["'\'']//' -e 's/["'\'']$//')

  # 5. Run git commit cleanly
  git commit -m "$msg"
}

# Alias to explicitly force Claude
alias gacc="gac claude"

Lessons Learned

Tell the model what you already know. Don't make it guess things you can easily extract.
Low temperature for tasks where you want some determinism.
Modelfiles are useful. You can create a named model configured for a specific job.
Model size, (V)RAM, and context size are all connected. On a constrained machine, you have to be intentional about all three.

Is this perfect?

No. It still sometimes misses the point of a change. It takes time on larger commits. There is room for improvement.

Why not just use Claude directly? That's the easiest thing to do, but it still costs me tokens. And I wanted to learn how local models work. How context windows affect output. How to tune a model for a specific job. That was the whole point for me.

It works offline, costs nothing 💰, and I understand every piece because I broke it and fixed it.

I find the best way to learn is to find a real use case, however trivial. It helps you understand concepts one thing at a time.

Next up: My learnings building a green field product with OpenSpec meant for Brown field projects

I welcome all constructive feedback and comments

Como Rodar IA no Seu Computador Sem Gastar Nada: Guia Completo com Ollama (2026)

Hermes AI — Mon, 29 Jun 2026 13:20:13 +0000

Como Rodar IA no Seu Computador Sem Gastar Nada: Guia Completo com Ollama (2026)

Tags: ia, ollama, opensource, tutorial

Você sabia que pode rodar modelos de inteligência artificial diretamente no seu computador, sem precisar pagar assinatura, sem depender de internet e sem enviar seus dados para servidores de terceiros?

Parece bom demais para ser verdade, mas em 2026 essa é uma realidade acessível para qualquer pessoa com um notebook mediano. Graças a ferramentas open source como o Ollama — que já ultrapassou 170 mil estrelas no GitHub — você pode ter uma IA funcionando localmente em menos de 10 minutos.

Neste guia prático, vou te mostrar:

O que é o Ollama e por que ele virou padrão
Como instalar no Windows, macOS e Linux
Quais modelos rodam em cada tipo de hardware
Como usar a IA local no dia a dia (terminal, API, VS Code)
Dicas para escolher o modelo certo para sua máquina

Por que rodar IA local?

Antes de mergulhar no passo a passo, vale entender os motivos que estão levando cada vez mais pessoas a adotar a IA local:

🔒 Privacidade total. Seus dados nunca saem da sua máquina. Isso é crucial para quem trabalha com documentos confidenciais, código proprietário ou informações pessoais.

💰 Custo zero. Nada de assinatura mensal. Depois do download inicial do modelo, você usa quantas vezes quiser, sem limite de tokens.

🌐 Funciona offline. Sem internet? Sem problemas. Você pode usar IA em viagens, áreas remotas ou durante quedas de conexão.

⚡ Velocidade consistente. Sem fila de espera, sem limite de requisições, sem depender de servidores sobrecarregados.

🛠️ Personalização total. Você escolhe o modelo, ajusta parâmetros, cria fine-tunes — o controle é seu.

O que é o Ollama?

Ollama é uma ferramenta open source que simplifica a execução de modelos de linguagem (LLMs) localmente. Pense nele como um "gerenciador de pacotes" para IAs: você baixa, executa e gerencia modelos com comandos simples.

Antes do Ollama, rodar um modelo local exigia lidar com dependências complexas, configurações de GPU, conversões de formato e scripts gigantescos. O Ollama eliminou toda essa complexidade com um comando só:

ollama run llama3.2

Pronto. Em segundos, você está conversando com uma IA rodando 100% na sua máquina.

Instalação em 3 passos

Windows

Acesse ollama.com/download
Baixe o instalador .exe
Execute e siga o assistente

Após a instalação, abra o Prompt de Comando ou PowerShell e digite:

ollama --version

Se aparecer o número da versão, tudo certo.

macOS

Com o Homebrew instalado, é só um comando:

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

O script detecta sua distribuição (Ubuntu, Fedora, Arch, etc.) e faz tudo automaticamente.

Seu primeiro modelo

Vamos rodar o modelo mais leve e rápido para começar:

ollama run llama3.2

Esse é o Llama 3.2 1B, da Meta. Ele tem apenas 1 bilhão de parâmetros e roda em qualquer computador com 8 GB de RAM, sem placa de vídeo dedicada.

O download acontece automaticamente na primeira execução (cerca de 700 MB). Em máquinas mais lentas, pode levar alguns minutos.

Depois é só digitar suas perguntas:

>>> O que é uma rede neural?
Uma rede neural é um modelo computacional inspirado no cérebro humano...

Para sair, digite /bye ou pressione Ctrl+D.

Quais modelos escolher (guia por hardware)

O grande segredo da IA local é escolher o modelo certo para sua máquina. Aqui vai um guia prático baseado em 2026:

🖥️ Notebook básico (8 GB RAM, sem GPU)

Modelo	Parâmetros	Tamanho	Uso ideal
Llama 3.2	1B / 3B	~700 MB / ~2 GB	Chat simples, perguntas básicas
Gemma 3	1B / 4B	~800 MB / ~2,5 GB	Respostas curtas, resumos
Phi-3.5 Mini	3,8B	~2,4 GB	Código, lógica

ollama run llama3.2:1b

💻 Notebook intermediário (16 GB RAM, sem GPU)

Modelo	Parâmetros	Tamanho	Uso ideal
Llama 3.2	3B	~2 GB	Chat, escrita criativa
Mistral	7B	~4,1 GB	Conversas mais profundas
Qwen 2.5	7B	~4,4 GB	Código e raciocínio
DeepSeek Coder V2 Lite	16B (IQ)	~6 GB	Geração de código

ollama run mistral

🚀 Desktop com GPU (16 GB+ VRAM)

Modelo	Parâmetros	VRAM	Uso ideal
Llama 4 Scout	17B	~10 GB	Tudo: chat, código, análise
Qwen 3	14B	~9 GB	Excelente em português
DeepSeek V3 Lite	16B	~9 GB	Raciocínio avançado
Gemma 4	9B	~6 GB	Contexto gigante (128K tokens)

ollama run llama4-scout

🏢 Workstation (24 GB+ VRAM)

Modelo	Parâmetros	VRAM	Uso ideal
Qwen 3	32B	~18 GB	Assistente completo
DeepSeek V3	67B	~40 GB	Estado da arte local
Llama 4 Maverick	90B (quantizado)	~48 GB	Máximo desempenho

Usando IA local no dia a dia

Pelo terminal

O Ollama já funciona como um chat direto no terminal, mas você também pode fazer perguntas pontuais sem entrar no modo interativo:

# Pergunta direta
ollama run mistral "Explique o que é Docker em uma frase"

# Com pipe
cat arquivo.txt | ollama run llama3.2 "Resuma este texto"

# Usando template
ollama run qwen3 "Traduza para o inglês: Como rodar IA localmente"

Pela API REST

Cada modelo que você roda com ollama run expõe automaticamente uma API local no endereço http://localhost:11434. Isso significa que você pode integrar a IA em seus próprios programas:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Escreva um poema sobre programação",
  "stream": false
}'

Em Python, a integração fica ainda mais simples:

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "qwen3",
    "prompt": "O que é API? Explique como se eu tivesse 10 anos",
    "stream": False
})

print(response.json()["response"])

No VS Code

A combinação mais poderosa de 2026 é Ollama + Cline (ou Continue.dev):

Instale a extensão Continue ou Cline no VS Code
Vá nas configurações e selecione "Ollama" como provedor
Escolha seu modelo local (ex: qwen3 ou llama4-scout)
Pronto! Agora você tem autocomplete e chat com IA 100% offline dentro do editor

Isso significa que você pode gerar código, refatorar funções, escrever testes e documentar projetos sem que nenhuma linha de código saia do seu computador. Perfeito para quem trabalha com código proprietário.

Comandos essenciais do Ollama

# Listar modelos baixados
ollama list

# Baixar um modelo sem executar
ollama pull llama4-scout

# Remover um modelo
ollama rm modelo-antigo

# Ver modelo em execução
ollama ps

# Criar um modelo personalizado (Modelfile)
ollama create meu-modelo --file Modelfile

# Atualizar Ollama
# Linux:
curl -fsSL https://ollama.com/install.sh | sh
# macOS:
brew upgrade ollama

Modelfile: criando seu próprio modelo

Você pode personalizar o comportamento de qualquer modelo com um Modelfile:

FROM mistral

# Define a personalidade
SYSTEM "Você é um assistente especializado em direito brasileiro. Responda sempre citando artigos de lei quando possível."

# Ajusta temperatura (0 = determinístico, 1 = criativo)
PARAMETER temperature 0.3

ollama create direito-br --file Modelfile
ollama run direito-br

Dicas para extrair o máximo

Menos é mais. Comece com modelos pequenos (1B-3B). Eles são rápidos e suficientes para 80% das tarefas do dia a dia.
Contexto importa. Modelos locais têm limite de contexto (normalmente 8K a 32K tokens). Para textos longos, divida em partes ou use modelos maiores como Gemma 4 (128K).
GPU acelera, mas não é obrigatória. Modelos até 7B rodam bem só com CPU e 16 GB de RAM. A diferença é que com GPU as respostas saem em segundos em vez de minutos.
Atualize os modelos periodicamente. A cada mês surgem versões melhores. ollama pull atualiza para a última versão disponível.
Combine ferramentas. Ollama + Open WebUI dá uma interface estilo ChatGPT para seus modelos locais. Ollama + AnythingLLM cria um RAG (busca em documentos) local completo.

Conclusão

Rodar IA localmente deixou de ser coisa de entusiasta para se tornar uma ferramenta prática e acessível. Com o Ollama, você instala em minutos, escolhe entre dezenas de modelos gratuitos e mantém o controle total sobre seus dados.

Não importa se você tem um notebook básico ou uma workstation potente — existe um modelo que roda na sua máquina e atende suas necessidades.

Em 2026, com a privacidade se tornando cada vez mais rara no mundo digital, ter sua própria IA local não é apenas uma opção interessante: é um passo rumo à autonomia tecnológica.

Teste você mesmo. Abra o terminal e digite:

ollama run llama3.2

Em menos de 2 minutos você terá uma IA conversando com você, rodando 100% no seu computador, sem pagar nada, sem depender de internet, sem compartilhar seus dados.

IA na Prática — tecnologia que você consegue usar hoje.

Gostou do artigo? Deixe seus comentários abaixo e compartilhe qual modelo você está usando localmente!

Stop Paying for Copilot: Run Local LLMs in VS Code & CLI (For Free)

Praveen Veera — Mon, 29 Jun 2026 13:03:02 +0000

Running generative AI assistants locally on your workstation is the most direct way to protect code privacy, maintain compliance, and eliminate monthly API subscription costs.

However, moving off the cloud is not as simple as installing an extension. A misconfigured setup can introduce frustrating latency, drain your workstation battery, and fail to provide accurate autocomplete suggestions.

This guide provides a conceptual overview of the local AI landscape followed by an actionable five-step guide to move your setup from the cloud to a fully local workstation.

1. Local vs. Cloud: Engineering Tradeoffs

Choosing a local setup is not a pure upgrade; it involves a series of engineering tradeoffs. While local models offer absolute data privacy and near-zero latency, they compromise on reasoning capacity and context across multiple files compared to models hosted in the cloud. Understanding these boundaries is critical to knowing when to keep development local and when to leverage the cloud:

Dimension	Local Assistant (e.g., Qwen 14B / Gemma 12B)	Cloud Assistant (e.g., Claude 3.5 Sonnet / GPT-4o)
Data Privacy	100% Private (No data leaves your workstation)	Subject to compliance review (Data sent to third party servers)
Token Cost	$0 / month (Runs entirely on local electricity)	$10–$20/mo subscription or fees based on token usage
Autocomplete Latency	~150ms (Instant, zero network delay)	~500ms - 1.2s (Depends on network stability and cloud congestion)
Offline Capability	Yes (Works on planes, trains, or secure offline VPCs)	No (Crashes instantly without active internet connection)
Cognitive Ceiling	Low to Medium (Struggles with reasoning across multiple files)	High (Resolves complex logic across different modules)

Where Local Models Fail

The Abstract Ceiling: A 14B model lacks the neural density to construct deep mental abstractions of complex codebases. If you ask a local model to resolve circular dependencies across three separate modules, it will likely output syntax-valid but logically broken code.
Rare Libraries & Edge Cases: Cloud models are pre-trained on terabytes of code, including obscure libraries and legacy documentation. Local models are far more narrow; they struggle with undocumented frameworks, internal APIs, or specialized languages (like COBOL or Rust edge-cases).
Multi-Modal Limitations: Local setups cannot parse wireframes or UI mockups to generate front-end CSS layouts on consumer GPUs without immediately triggering out-of-memory (OOM) errors.

The Local Model Landscape

Qwen2.5-Coder (The Gold Standard): Google-rivaling coding performance. It is optimized specifically for Fill-in-the-Middle autocomplete tasks, making it the most fluent local coding weight available today.
DeepSeek-Coder (The Alternative): Highly optimized for Python and C++ structures. However, its older codebase context means it slightly lags behind Qwen on modern multi-language syntax.
Gemma 4 QAT (The Logic Specialist): Excellent logic capabilities and a robust 32k context capability, though it requires custom parameter configuration in Ollama to run smoothly.

2. The Systems Metrics That Matter

When running local models, developer experience is governed by three primary systems metrics:

Time to First Token (TTFT) / Context Pre-fill Latency: The delay (in milliseconds) between triggering an autocomplete completion and the model generating its first character. In autocomplete, a TTFT above 250ms breaks your visual typing flow.
Token Generation Throughput (Tokens/Second): The speed at which the model streams its output text once it starts writing. For real-time reading, you need at least 20–30 tokens/second. For autocomplete, the model should complete lines instantly (75+ tokens/second).
VRAM Footprint vs. System Memory Swap: If a model fits 100% inside VRAM, it runs at full speed. If it overflows by even 10MB, the OS pages the remaining weights to system RAM, creating a massive memory bus bottleneck. This drops speeds from 30 tokens/sec to under 2 tokens/sec. Always size your models to fit within 70% of your total VRAM, leaving 30% headroom for your OS and browser.

🚀 The Local AI Developer Journey

  ├── Step 1: Audit Your Hardware (VRAM Sizing)
  ├── Step 2: Spin Up the Model Runner (Ollama)
  ├── Step 3: Link the IDE Interface (Continue config.json)
  ├── Step 4: Protect Workspace CPU (.continueignore)
  └── Step 5: Expand to the Command Line (CLI Pipes)

Step 1: Audit Your Hardware (The "Kitchen Counter" Rule)

Running models locally requires matching model parameters to your system's memory (VRAM/RAM).

💡 The Kitchen Counter Analogy: Think of VRAM (GPU memory) as your kitchen counter, and system RAM/swap as the pantry down the hall. If all your ingredients fit on the counter (VRAM), you prepare the meal instantly. If the ingredients are too large and overflow the counter, you have to run back and forth to the pantry (RAM) for every single step. Your cooking speed collapses. Keep your models strictly within VRAM bounds.

Here is your hardware compatibility reference sheet:

System VRAM (Kitchen Counter)	Model Parameter Size	Recommended Models	Quantization	VRAM Footprint
8 GB	1B - 3B	`qwen2.5-coder:1.5b`	`Q4_K_M`	~1.6 GB
16 GB	7B - 8B	`qwen2.5-coder:7b`	`Q4_K_M`	~4.7 GB
24 GB	12B - 14B	`qwen2.5-coder:14b`	`Q4_K_M`	~9.3 GB
32 GB+	14B - 22B	`codestral:22b`	`Q4_K_M`	~15.1 GB

Sizing Models to Task Complexity

To optimize compute resources, structure your workflow by mapping developer tasks to model sizes:

Simple Tasks (Tab Autocomplete & Syntax Matching): Single-line completions, closing parentheses, standard imports, variable assignments. Requires < 200ms latency. Sized at 1.5B to 3B parameters (e.g., Qwen2.5-Coder-1.5B-Base).
Medium Tasks (Context-Aware Chat & Unit Testing): Writing utility functions, refactoring single files, generating test suites, explaining compilation errors. Sized at 7B to 14B parameters (e.g., Qwen2.5-Coder-14B-Instruct or Gemma-4-12B).
Complex Tasks (Multi-File Debugging & System Architecture): Architectural planning, debugging cross-module dependencies, codebase index search. Sized at 22B+ parameters (e.g., Codestral-22B or private VPC-hosted 70B+ models).

Step 2: Spin Up the Model Runner (Ollama)

Ollama acts as the engine room of your setup. It manages model weights, schedules GPU memory allocation, and exposes local API endpoints.

Download and install Ollama for macOS.

Pull the two models we need (one lightweight model optimized for tab autocomplete, and one larger model for reasoning in chat):

# Pull the lightweight autocomplete model (Base model)
ollama pull qwen2.5-coder:1.5b-base

# Pull the chat sidebar reasoning model (Instruct model)
ollama pull qwen2.5-coder:14b-instruct

(Optional) Tuning Parameters via a Custom Modelfile

If you need custom parameters, such as running Gemma 4 12B QAT with an expanded 32k context window:

Locate your local GGUF file directory and create a Modelfile:

FROM /path/to/local/gemma-4-12b-it-QAT.gguf
PARAMETER num_ctx 32768

Build the model in Ollama:

ollama create gemma4:12b-qat-32k -f Modelfile

Step 3: Link the IDE Interface (Continue config.json)

Now we connect VS Code to your local Ollama engine using the open-source Continue.dev extension.

Install the Continue extension in VS Code.
Open the Continue settings (config.json) and configure it to point to your local Ollama instance:

{
  "models": [
    {
      "title": "Ollama - Qwen 14B Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Ollama - Gemma 4 QAT",
      "provider": "ollama",
      "model": "gemma4:12b-qat-32k",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Ollama - Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base",
    "apiBase": "http://localhost:11434"
  }
}

Enabling the VS Code CLI Command

To open your configuration file directly from your terminal, enable the VS Code shell utility:

Open VS Code, open the Command Palette (Cmd+Shift+P on macOS, Ctrl+Shift+P on Windows/Linux).
Run: Shell Command: Install 'code' command in PATH.
Now, you can open and edit your configuration file directly from your terminal:
```
code ~/.continue/config.json
```

Replacing Copilot Features 1-to-1

Once Continue is connected to your local model runner, here is how you trigger the models to replace Copilot's core capabilities:

Inline Autocomplete (Ghost Text): As you write code, the lightweight Qwen-1.5B-Base model streams single-line completions inline. Press Tab to accept.
In-Place Code Editing (Cmd+I / Ctrl+I): Select a block of code, press Cmd+I (macOS) or Ctrl+I (Windows/Linux), type your editing instruction (e.g. "Convert this loop to a list comprehension"), and press Enter. The model will edit the file inline.
Sidebar Chat & Context (Cmd+L / Ctrl+L): Press Cmd+L to open the chat panel. Type @ to reference specific files, terminal shell commands, or your entire codebase index, routing the queries to your larger Qwen-14B-Instruct model.

ℹ️ Isolate Autocomplete from Chat: Do not route both chat and autocomplete to the same model. Tab autocomplete requires immediate responses. Use Qwen-1.5B-Base for autocomplete (optimized for fast, inline Fill-in-the-Middle tasks) and Qwen-14B-Instruct for the chat sidebar.

Workstation Benchmark Results (Measured Live on Apple M5 Pro)

To prove local viability, we measured prompt pre-fill speeds (Time to First Token) and token generation throughput (text output speed) using your hardware configuration:

Model Configuration	Parameter Size	VRAM Footprint	Quantization	Context Pre-fill Speed	Token Generation Speed	Sizing Latency
Qwen2.5-Coder (Base)	1.5B	1.6 GB	`Q4_K_M`	190.6 tok/s	188.4 tok/s	< 80ms (Real-time autocomplete)
Gemma 4 QAT	12B	7.0 GB	`Q4_K_M`	129.5 tok/s	34.8 tok/s	Real-time reasoning
Qwen2.5-Coder (Instruct)	14B	9.0 GB	`Q4_K_M`	214.8 tok/s	30.0 tok/s	Cloud-parity chat speed

Benchmark Test Script & Code Reference

The benchmark tests were executed locally using the companion test script. The full source code is hosted in the companion repository:
👉 software-permanence/01-local-llm-vscode

Here is the raw terminal log output of running test_local_llm.py against Ollama:

=== Running Local LLM Workstation Benchmark ===
Target model: qwen2.5-coder:14b (Q4_K_M)

[Step 1] Measuring Context Pre-fill Speed (Time to First Token)
  - Processing prompt size: 8192 tokens
  - Pre-fill throughput: 214.8 tokens/second

[Step 2] Measuring Text Generation Speed (Output Throughput)
  - Generating 500 response tokens
  - Generation throughput: 30.0 tokens/second

[Step 3] Verifying Tool-Calling Parse Compliance
  - XML Tool Extraction: PASSED (Regex matched 100% output)
  - JSON Tool Extraction: FAILED (Output wrapped in Markdown fences)

=== Validation Complete: Qwen 14B behaves at cloud-parity speed ===

Step 4: Protect Workspace CPU (.continueignore)

By default, Continue tries to index every file in your workspace to build local vector embeddings for chat retrieval. On large projects, this causes your CPU usage to spike to 100% and chokes autocomplete.

To prevent this, create a .continueignore file in the root of your project directory:

.git/
node_modules/
dist/
build/
.svelte-kit/
*.log

Fixing Context Shifting Latency

Autocomplete can freeze for 2-3 seconds when you switch tabs because Continue is parsing the entire contents of the new file.

The Fix: In VS Code settings, search for Continue: Tab Autocomplete Options, and set Prefix Length to 500 and Suffix Length to 250. Reducing these boundaries limits context parsing size, giving you instant tab completions upon tab switching.

Step 5: Expand to the Command Line (Terminal Agents & Pipes)

Once your local model runner is set up, you aren't restricted to the IDE. Ollama’s desktop interface includes a native Launch registry that allows you to spin up open-source terminal agents directly from your CLI.

⚠️ Beginner Warning (The Git Sandbox Rule): Terminal-native agents (opencode, claude) execute edits and run commands directly on your local system. Before launching an agent from your CLI, always ensure you are running it inside a clean Git repository. If the agent runs a destructive command or writes broken code, you can roll back your workspace instantly via git reset --hard.

1. Launching Terminal-Native Coding Agents

Instead of paid cloud services, you can run autonomous command-line developers directly inside your shell:

OpenCode (Anomaly's open-source coding agent): An autonomous terminal coder that reads build logs, refactors files, and handles tasks locally:
```
ollama launch opencode
```
Copilot CLI (Terminal helper agent): Explains shell commands, generates commands from natural language, and handles prompt operations in your terminal:
```
ollama launch copilot-cli
```
Claude Code (Subagent coding CLI): Anthropic’s subagent developer interface configured to run locally:
```
ollama launch claude
```

2. Piping Logs for Custom Debugging

For quick troubleshooting, you can pipe compiler errors or log dumps directly into the model without copying and pasting:

# Pipe an execution error log to Ollama
cat error.log | ollama run qwen2.5-coder:14b "Explain this error and suggest a fix"

Direct Programmatic API Access

You can call your local models directly inside your applications or custom tooling. Here is how to execute a generation request using Curl and Python:

Using Curl:

curl -s -X POST http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:14b",
  "prompt": "Convert this bash script to a Python script: $(cat build.sh)",
  "stream": false
}' | jq '.response'

Using Python:

import urllib.request
import json

payload = {
    "model": "qwen2.5-coder:14b",
    "prompt": "Convert this bash script to a Python script.",
    "stream": False
}

req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=json.dumps(payload).encode("utf-8"),
    headers={"Content-Type": "application/json"}
)

with urllib.request.urlopen(req) as response:
    response_data = json.loads(response.read().decode("utf-8"))
    print(response_data.get("response"))

Pro-Tips & Troubleshooting

Issue: Port 11434 is Already in Use

On macOS, Ollama runs as a background service and will block port 11434 even if the app UI is closed.

The fix: Manually kill the background process via terminal:
```
pkill Ollama
```

Issue: Zero-Lag Loading (keep_alive)

By default, Ollama unloads models from memory after 5 minutes of inactivity. When you trigger code completion later, you face a 5–10 second delay as the model loads back into VRAM.

The fix: Set the model to remain permanently loaded in GPU memory by configuring the keep_alive parameter to -1 (always stay in memory) or 30m (30 minutes) in your API settings.

🔰 Beginner's Troubleshooting Checklist

If your local development setup is failing, use this diagnostic guide to find the cause:

Is Ollama running? Open your terminal and run ollama list. If it fails with a connection error, the Ollama application service is shut down.
Is autocomplete lagging? If suggestions take more than 2-3 seconds, check if your model is spilling into system RAM. In Activity Monitor (macOS) or Task Manager (Windows), look at memory swap. If swap is active, you are running a model too large for your VRAM.
Is Continue forgetting instructions? If the sidebar chat stops responding or behaves erratically, you have hit the context limit of the loaded model. Restart the chat session to clean the active history window.

Summary

Running local models provides code privacy and offline capabilities. By combining Ollama, LM Studio, and Continue, you can configure a usable local developer environment in both your IDE and terminal.

What models are you running locally for autocomplete? Let me know in the comments.

Hi, I'm Praveen Veera. I build practical AI systems, specializing in Enterprise AI Platforms, Local LLMs, and Dev Tools.

Read my notes:

Substack Newsletter: praveenbuilds.substack.com
LinkedIn: linkedin.com/in/praveen-veera-6ab22567
GitHub (Companion Code): github.com/praveenveera/software-permanence
Dev.to: dev.to/praveen_builds
Medium: medium.com/@praveenveera92
Instagram: @praveen.builds
Hashnode: hashnode.com/@praveen-builds

Ollama 'llama runner process has terminated'? Read the Exit Code, Then Fix It (2026)

Jovan Chan — Mon, 29 Jun 2026 07:06:02 +0000

This article was originally published on runaihome.com

TL;DR: Error: llama runner process has terminated means the backend that actually runs the model died before it could load. The fix depends entirely on the code after it — exit status 2 is usually a GPU/VRAM or driver-library mismatch, 0xc0000409 on Windows is an illegal CPU instruction (no AVX), and signal: killed on Linux is the kernel's OOM killer reclaiming system RAM. Read the code first; don't reinstall blindly.

What you'll be able to do after this guide:

Decode the four termination codes you'll actually see in 2026 and map each to a root cause
Pull the one line from the Ollama server log that tells you what really happened
Apply the specific fix — context size, GPU layers, quant, or driver — instead of guessing

Honest take: This error scares people because it looks like a crash deep in C++ land, but 90% of cases are one of three boring things: the model doesn't fit in memory, your CPU is too old for the prebuilt binary, or a GPU library got swapped under Ollama's feet. The exit code narrows it to one of those in about ten seconds. Find the code, then read the matching section below.

Step 1: Read the exit code (this is the whole diagnosis)

The full error always has the same shape:

$ ollama run llama3.1:8b
Error: llama runner process has terminated: exit status 2

That trailing token — exit status 2, exit status 0xc0000409, signal: killed, signal: aborted — is not noise. It's the operating system reporting how the runner subprocess died, and it points straight at the cause. Here's the map:

What you see	Platform	Almost always means	Jump to
`exit status 2`	Any	GPU library/driver mismatch, VRAM overflow, or bad GGUF	Cause A
`exit status 0xc0000409`	Windows	CPU lacks AVX/AVX2 (illegal instruction) or a GPU runtime fault	Cause B
`signal: killed`	Linux/Docker	Kernel OOM killer — system RAM exhausted	Cause C
`signal: aborted` / SIGABRT	Linux/Mac	Internal assertion failed (often a corrupt or unsupported model)	Cause D

These codes are stable across Ollama versions — they come from the OS, not Ollama. As of this writing the current release is Ollama v0.30.8 (June 12, 2026), and the behavior below was confirmed against the 0.30.x line. If you're more than a few versions behind, updating is a legitimate first move (see the bottom of Cause A) — but read your code first so you know what you're actually chasing.

Step 2: Get the real reason from the server log

The one-line CLI error is a summary. The runner writes its actual death note to the server log before it dies. Find it:

Linux (systemd): journalctl -u ollama --no-pager | tail -n 50
macOS: cat ~/.ollama/logs/server.log | tail -n 50
Windows: open %LOCALAPPDATA%\Ollama\server.log (i.e. C:\Users\<you>\AppData\Local\Ollama\server.log)

Scroll to the lines just before the termination. You're hunting for one of these tells:

SIGILL: illegal instruction
CUDA error: out of memory
cudaMalloc failed: out of memory
entering low vram mode
error loading model: unable to allocate backend buffer

Whichever line shows up confirms which cause below applies. Don't skip this step — it's the difference between a five-minute fix and an afternoon of reinstalling drivers you didn't need to touch.

Cause A — `exit status 2`: VRAM, driver libraries, or a bad model

This is the catch-all crash, and it has three common flavors.

A1. The model doesn't fit (most common). If the log shows CUDA error: out of memory, cudaMalloc failed, or entering low vram mode right before the crash, the runner tried to allocate more VRAM than the card has and died. This is the same root cause covered in depth in our CUDA out of memory fix guide — the short version:

Shrink the context. The KV cache scales with context length and quietly dominates VRAM at long contexts. Cap it:

  # per-session
  $ OLLAMA_CONTEXT_LENGTH=4096 ollama serve
  # or in the systemd service: Environment="OLLAMA_CONTEXT_LENGTH=4096"

Drop to a smaller quant. A q4_K_M build of an 8B model needs ~6–7 GB; the q8_0 of the same model needs ~9 GB. If you're at the edge, the smaller quant is the cheapest win. (If you're unsure which quant to pick, see quantization explained.)
Let some layers spill to CPU on purpose. Setting num_gpu to a value lower than the model's layer count offloads the rest to RAM — slower, but it loads instead of crashing:

  $ ollama run llama3.1:8b --num-gpu 28

A2. A swapped GPU library (AMD/ROCm and custom builds). A frequently reported version of exit status 2 happens after someone manually replaces Ollama's bundled ROCm libraries to force support for an unsupported architecture — for example dropping gfx1031 files in to make a Radeon RX 6750 XT work. When the patched library and the runner disagree, the runner faults on load. If you've hand-edited anything under Ollama's lib/ directory, reinstall Ollama cleanly to restore the matched binaries, then let it auto-detect the GPU rather than forcing an architecture.

A3. A corrupt or partially downloaded model. If the crash is specific to one model and only after an interrupted pull or an offline copy, the GGUF blob may be truncated. Re-pull it:

$ ollama rm llama3.1:8b
$ ollama pull llama3.1:8b

If A1–A3 don't apply and you're several releases behind, update Ollama — GGUF/llama.cpp hardware support broadens with nearly every release, and v0.30.8 specifically expanded the set of cards and quant formats the runner accepts.

Cause B — `exit status 0xc0000409` on Windows: your CPU, not your GPU

0xc0000409 is a Windows NTSTATUS code for an illegal-instruction exception. Despite how it reads, this is usually not a memory bug — it's the CPU being asked to execute an instruction it doesn't have. In practice that means the prebuilt Ollama runner uses AVX/AVX2 and your processor doesn't support it. This has been reported across model families (phi3, llama3.2) on older Intel and budget CPUs going back to Ollama 0.1.x, and the SIGILL line in the log is the confirmation.

What works:

Confirm the CPU is the issue. In the log, an illegal instruction / SIGILL line right before the exit confirms AVX is the culprit. You can also check your CPU's spec sheet for "AVX2" support.
Force a GPU load so the CPU path is never taken. If you have a supported NVIDIA/AMD GPU large enough for the model, make sure Ollama is actually using it (run ollama ps and look for 100% GPU). When the model runs entirely on the GPU, the AVX-dependent CPU kernels aren't exercised. If ollama ps shows a CPU/GPU split, you're back in CPU territory — shrink the model until it fits fully on the GPU. Our Ollama not using GPU guide walks through forcing GPU detection.
If there's no AVX and no usable GPU, that machine genuinely can't run the prebuilt binary. The honest answer is to run inference somewhere else — a different box, or a rented cloud GPU. For occasional jobs, RunPod is cheaper than buying a new CPU.

A second, rarer flavor of 0xc0000409 is a GPU runtime fault — a mismatched or corrupted CUDA/driver install rather than a CPU issue. If the log shows CUDA errors instead of SIGILL, update your NVIDIA driver and reinstall Ollama, the same way you'd treat Cause A2.

Cause C — `signal: killed` on Linux: the OOM killer got you

signal: killed is SIGKILL, and on Linux the usual sender is the kernel's out-of-memory (OOM) killer. When loading a model pushes total system RAM past the limit, the kernel picks a process and terminates it instantly — no cleanup, no error message from Ollama, the runner just vanishes. Confirm it:


bash
$ dmesg | grep -i "killed process"
[ 4823.

Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)

duke — Mon, 29 Jun 2026 01:22:31 +0000

Most RAG tutorials open with "set your OPENAI_API_KEY." This one doesn't need it. In Part 1 I claimed the LLM and embeddings are behind a swappable boundary — "switch providers via config, not code." Part 3 is me cashing that claim: running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with zero API keys and no Docker, just Ollama and an embedded Qdrant.

Everything below is real output from an actual run. Including the one thing that broke.

What "offline" actually requires

Three pieces, all local:

Ollama running two models — one for chat, one for embeddings:

  ollama pull qwen3.5:9b   # chat / reasoning
  ollama pull bge-m3       # embeddings (1024-dim, multilingual)

Embedded Qdrant — no server, no container. The vector store writes to a local directory.
A one-line config flip so chat goes to Ollama instead of the gateway:

  CHAT_PROVIDER=ollama

That's it. No OPENAI_API_KEY, no docker compose up. The reason this is a flip and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work.

The embeddings factory — swap by config

# app/llm/embeddings.py
@lru_cache
def get_embeddings() -> Embeddings:
    s = get_settings()
    provider = s.embedding_provider.lower()

    if provider == "ollama":
        from langchain_ollama import OllamaEmbeddings
        return OllamaEmbeddings(model=s.embedding_model, base_url=s.ollama_url)

    if provider == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(base_url=f"{s.litellm_url}/v1",
                                api_key=s.litellm_key, model=s.embedding_model)

    raise ValueError(f"unknown embedding_provider: {s.embedding_provider!r}")

Both branches return the same LangChain Embeddings interface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama (offline). Production → OpenAI via the gateway. One caveat that matters later: the two providers produce different vector dimensions, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.

The vector store — embedded vs. remote, also by config

# app/rag/store.py
@lru_cache
def get_client() -> QdrantClient:
    s = get_settings()
    if s.qdrant_url:
        return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)  # remote (prod)
    return QdrantClient(path=s.qdrant_path)                             # embedded (local)

No QDRANT_URL? You get an embedded client that persists to s.qdrant_path — a plain directory. Set QDRANT_URL in prod and the same code talks to a real Qdrant service. The trade-off of embedded mode: it locks the directory to a single process, which becomes gotcha #2.

Ingestion: docs → chunks → vectors

The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.

# scripts/ingest.py (trimmed)
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_documents(documents)

# probe the embedding dimension so the collection matches the provider
dim = len(get_embeddings().embed_query("probe"))
ensure_collection(dim)
get_vector_store().add_documents(chunks)

The embed_query("probe") trick is worth pausing on: instead of hard-coding 1024 for bge-m3 (or 1536 for OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.

Running it for real:

$ python scripts/ingest.py --reset
[ingest] source=docs  collection=docs  embed=ollama:bge-m3
[ingest] 5 documents → 53 chunks
[ingest] embedding dim = 1024
[ingest] done — 53 points in collection

Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.

Running the agent — no server needed

You can hit the FastAPI endpoint, but to see the graph think you can also invoke it directly. Here's a real run, asking about something that lives in the docs:

res = await graph.ainvoke({"messages": [HumanMessage(content=
    "How is short-term vs long-term memory implemented in this project?")]})

print([type(m).__name__ for m in res["messages"]])
# ['HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage']

That message sequence is the ReAct loop, visible in the state:

HumanMessage — the question
AIMessage with tool_calls=[search_docs(...)] — the model decides to retrieve
ToolMessage — the retrieved chunks come back
AIMessage — the final synthesized answer

And the answer itself, generated entirely by a 9B model on the laptop:

Short-term memory: PostgreSQL (PostgresSaver) stores per-thread
  conversation state; swappable to Redis (RedisSaver) if needed.
Long-term memory: Zep manages the user's persistent knowledge,
  recalled by the app on later turns.

Sources: <doc-a>.md, <doc-b>.md

Grounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.

Gotchas (the part that's actually worth reading)

1. The empty synthesis turn — the local model, not the pipeline

On one run, the exact same question produced this:

[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'
[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked
[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer

Retrieval succeeded. The chunks were right there in step 2. But step 3 — the model's job to read the chunks and answer — came back empty. finish_reason='stop', no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's intermittent: a small local model occasionally produces an empty turn after a tool call.

Two things to take away:

It's the model, not your graph. The pipeline (routing → retrieval → state) was flawless; the synthesis step just whiffed.
The saw_token fallback from Part 2 won't save you here — that fallback calls ainvoke when no tokens stream, but here ainvoke is the empty result. The real mitigations are a larger/better tool-tuned local model, or accepting some flakiness as the price of fully offline. Worth knowing before you demo it live.

2. Embedded Qdrant locks the directory

Embedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: ingest first → let it exit → then start the server. The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.

3. Embedding dimensions must match end to end

bge-m3 is 1024-dim; OpenAI's text-embedding-3-small is 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching embedding_provider means re-ingesting (--reset). The embed_query("probe") dimension check is exactly what keeps the collection honest per provider.

4. The first call is slow

Ollama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.

Why this matters

You can build, debug, and demo the entire RAG agent — graph, retrieval, citations — on a plane with no wifi. Then, for production, you flip two config values (CHAT_PROVIDER, QDRANT_URL) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 claimed the provider boundary; Part 3 ran on both sides of it.

The flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.

Next: persisting conversation threads with a checkpointer — so the agent remembers across requests — and what that adds to the message log you just saw.

Part 3 of a series on running LangGraph in production. Part 1 · Part 2.

My RAG Benchmark is lying to me

Dogukan Karademir — Sun, 28 Jun 2026 21:45:58 +0000

I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.

Here's the specific problem that broke my assumptions.

The Setup

Kenning is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.

I built a benchmark to test six local models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, gemma2:9b, mistral:7b, phi4:14b.

Four question categories, judged blind by qwen2.5:14b:

IN_CONTEXT — answer is in the document
OUT_OF_CONTEXT — answer isn't; model must refuse
PARTIAL_CONTEXT — partial information; model must say what it found and what's missing
MULTI_CHUNK — answer spans multiple sections

Maximum 875 points per model at 35 questions.

First Problem: The Ceiling Effect

First run, 20 questions on Attention Is All You Need (the Transformer paper):

Model	Score
`qwen2.5:7b`	481/500 — 96.2%
`llama3.1:8b`	475/500 — 95.0%
`phi4:14b`	474/500 — 94.8%
`gemma2:9b`	473/500 — 94.6%
`llama3.2:3b`	466/500 — 93.2%
`mistral:7b`	463/500 — 92.6%

IN_CONTEXT category: every single model averaged 25/25. Perfect score.

This is what a useless benchmark looks like. Questions like "How many attention heads does the Transformer use?" are trivially easy if the retrieved chunk contains h = 8. I wasn't measuring model capability — I was measuring whether models can read.

I added 15 harder questions and rewrote the chunking.

The Rewrite That Changed Everything

The original code used TokenTextSplitter with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:

TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(200)
    .withKeepSeparator(true)
    .build();

List<Document> chunks = splitter.apply(documents);
List<Document> overlapped = overlapAppender.addOverlap(chunks, 100);

The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.

New results on 35 questions, same document:

Model	Score
`phi4:14b`	839/875 — 95.9%
`qwen2.5:7b`	822/875 — 93.9%
`gemma2:9b`	818/875 — 93.5%
`llama3.1:8b`	815/875 — 93.1%
`mistral:7b`	780/875 — 89.1%
`llama3.2:3b`	771/875 — 88.1%

The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.

Here's the Problem

I changed two things at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.

And I can prove the chunking changed what models actually saw.

Question q01: "How many attention heads does the base Transformer use?" — categorized as IN_CONTEXT because the answer (h = 8) is in the paper.

Original chunking: retrieved a chunk containing h = 8. Model answered correctly.

New chunking: retrieved chunks about multi-head attention applications. The specific h = 8 chunk was no longer in the top 5 by similarity score. phi4:14b correctly said: "The provided context does not specify the number of attention heads."

Judge score: 25/25. The model isn't lying — it answered correctly given what it received.

But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.

So here's what I was actually measuring: model behavior given what my chunking strategy retrieved — not model capability. The "model benchmark" was really a "chunking configuration benchmark." I just didn't realize it until the results changed.

The Second Document Made It Worse

I added a second document — NIST SP 800-63B, a US federal authentication standard. ~70 pages of SHALL/SHOULD requirements, distributed across sections and tables. Nothing like an academic paper.

Same questions structure, same judge, same chunking.

Model	Transformer paper	NIST	Drop
`phi4:14b`	95.9%	90.9%	−5.0 pp
`mistral:7b`	89.1%	88.6%	−0.5 pp
`qwen2.5:7b`	93.9%	87.8%	−6.1 pp
`gemma2:9b`	93.5%	83.4%	−10.1 pp
`llama3.1:8b`	93.1%	83.2%	−9.9 pp
`llama3.2:3b`	88.1%	79.3%	−8.8 pp

mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).

Now I have two explanations and no way to distinguish them:

First Guess: These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.

Explanation B: Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything.

Recent research highlights exactly how much the structure of a document dictates the winning pipeline. For instance, a February 2026 benchmark by Vecta evaluating 7 chunking strategies across 50 academic papers found that standard recursive 512-token splitting took 1st place with 69% accuracy. In that specific domain, semantic chunking tanked at 54% because it over-fragmented the text, producing tiny snippets averaging just 43 tokens that stripped away crucial context. For a standard academic paper, fixed-size or recursive chunking is often perfectly fine or even superior.

Conversely, when dealing with complex, non-linear layouts, fixed token limits completely collapse. A separate study evaluating structured/clinical documents found that adaptive, theme-boundary chunking reached 87% accuracy, while fixed-size baselines plummeted to a dismal 13%.

This completely recontextualizes my results. My naive 200-token split with 100-token overlap happened to work reasonably well for the uniform, dense layout of the Transformer paper. But when applied to a 70-page regulatory standard like NIST—where a single requirement might be scattered across cross-referenced sections and multi-row tables—it arbitrarily butchered the text. Models like gemma2 that are highly sensitive to context fragmentation fell off a cliff, while mistral proved much more resilient at handling the poorly sliced context.

The takeaway isn't that semantic chunking is a silver bullet—it's that a one-size-fits-all chunking pipeline is fundamentally broken. The experiment that would actually prove this — running the same models with multiple chunking configurations (fixed vs. semantic vs. structure-aware) on the exact same document — is the one I didn't do.

What I'd Actually Need to Know Which Model to Pick

Multiple chunking strategies per document type, held constant while varying models
Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)
Multiple judge models, not just one (my judge could have systematic biases I can't detect)
Real user questions from actual sessions, not questions I wrote after reading the document myself
Multiple runs per model to account for non-determinism

Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."

The Honest Takeaway

I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."

The thing I expected to matter most — model choice — turned out to be inseparable from chunking strategy, retrieval configuration, and document structure. Changing chunk size doesn't change which model is capable of what. It changes what the model sees. And what the model sees determines everything.

If I had to tell someone one thing before they start benchmarking models for RAG: measure your retrieval quality first. If the right chunks aren't being retrieved, you're not benchmarking models — you're benchmarking whether your similarity search surfaces the right context. Those are very different problems.

My RAG Benchmark is lying to me

Dogukan Karademir — Sun, 28 Jun 2026 21:45:58 +0000

I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.

Here's the specific problem that broke my assumptions.

The Setup

Kenning is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.

I built a benchmark to test six local models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, gemma2:9b, mistral:7b, phi4:14b.

Four question categories, judged blind by qwen2.5:14b:

IN_CONTEXT — answer is in the document
OUT_OF_CONTEXT — answer isn't; model must refuse
PARTIAL_CONTEXT — partial information; model must say what it found and what's missing
MULTI_CHUNK — answer spans multiple sections

Maximum 875 points per model at 35 questions.

First Problem: The Ceiling Effect

First run, 20 questions on Attention Is All You Need (the Transformer paper):

Model	Score
`qwen2.5:7b`	481/500 — 96.2%
`llama3.1:8b`	475/500 — 95.0%
`phi4:14b`	474/500 — 94.8%
`gemma2:9b`	473/500 — 94.6%
`llama3.2:3b`	466/500 — 93.2%
`mistral:7b`	463/500 — 92.6%

IN_CONTEXT category: every single model averaged 25/25. Perfect score.

I added 15 harder questions and rewrote the chunking.

The Rewrite That Changed Everything

The original code used TokenTextSplitter with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:

TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(200)
    .withKeepSeparator(true)
    .build();

List<Document> chunks = splitter.apply(documents);
List<Document> overlapped = overlapAppender.addOverlap(chunks, 100);

The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.

New results on 35 questions, same document:

Model	Score
`phi4:14b`	839/875 — 95.9%
`qwen2.5:7b`	822/875 — 93.9%
`gemma2:9b`	818/875 — 93.5%
`llama3.1:8b`	815/875 — 93.1%
`mistral:7b`	780/875 — 89.1%
`llama3.2:3b`	771/875 — 88.1%

The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.

Here's the Problem

I changed two things at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.

And I can prove the chunking changed what models actually saw.

Question q01: "How many attention heads does the base Transformer use?" — categorized as IN_CONTEXT because the answer (h = 8) is in the paper.

Original chunking: retrieved a chunk containing h = 8. Model answered correctly.

Judge score: 25/25. The model isn't lying — it answered correctly given what it received.

But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.

The Second Document Made It Worse

Same questions structure, same judge, same chunking.

Model	Transformer paper	NIST	Drop
`phi4:14b`	95.9%	90.9%	−5.0 pp
`mistral:7b`	89.1%	88.6%	−0.5 pp
`qwen2.5:7b`	93.9%	87.8%	−6.1 pp
`gemma2:9b`	93.5%	83.4%	−10.1 pp
`llama3.1:8b`	93.1%	83.2%	−9.9 pp
`llama3.2:3b`	88.1%	79.3%	−8.8 pp

mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).

Now I have two explanations and no way to distinguish them:

First Guess: These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.

Explanation B: Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything.

What I'd Actually Need to Know Which Model to Pick

Multiple chunking strategies per document type, held constant while varying models
Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)
Multiple judge models, not just one (my judge could have systematic biases I can't detect)
Real user questions from actual sessions, not questions I wrote after reading the document myself
Multiple runs per model to account for non-determinism

Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."

The Honest Takeaway

I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."

How I Run My Content Tooling on a Local Model for $0

Hugo Kuznicki — Sun, 28 Jun 2026 04:58:53 +0000

A few months ago I added up what I was spending on AI APIs just to draft social posts. It wasn't a lot — a few dollars here, a few there — but it was a recurring cost for something I do every single day. And every time I wanted to experiment, regenerate, or tweak a prompt, a little meter ticked in the back of my head telling me to stop wasting tokens.

So I moved the whole thing local. No API keys, no per-token billing, nothing leaving my machine. Here's exactly how, including the parts that aren't as clean as the pitch.

Why local at all?

Three reasons, in order of how much they actually mattered to me:

Cost goes to zero. Not "cheaper" — zero. Once the model is on your disk, generating a thousand drafts costs the same as generating one.
Iteration becomes free, which changes your behavior. This is the part nobody tells you. When each generation is metered, you ration attempts. When it's free, you regenerate aggressively — and the output gets better because you stop being precious about it.
Privacy by default. My prompts, drafts, and half-baked ideas never touch a third-party server. For content I haven't published yet, that's a real comfort.

The setup: Ollama in five minutes

Ollama is the easiest way to run an LLM locally. Install it, pull a model, and you've got an HTTP server on localhost that speaks a simple API.

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull an instruct-tuned model
ollama pull llama3.1:8b

# It's now serving on http://localhost:11434

That's the entire infrastructure. No account, no key, no dashboard. The model runs as a local service and you talk to it over HTTP like any other API — except this one is on your machine and free.

The pipeline

My content workflow is deliberately boring: one topic in, a batch of platform-specific posts out. The whole thing is a thin layer around three ideas — a per-platform prompt template, a call to the local model, and a tiny bit of cleanup.

Here's the core call. Ollama exposes a /api/generate endpoint:

import requests

def generate(prompt, model="llama3.1:8b"):
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False},
    )
    return resp.json()["response"].strip()

No SDK, no auth header, no OPENAI_API_KEY in your environment. It's just a POST to localhost.

The interesting part is the templating. Each platform gets its own prompt with its own constraints baked in:

TEMPLATES = {
    "twitter": (
        "Write 3 punchy tweet hooks about: {topic}\n"
        "Rules: under 280 chars, no hashtags, no emoji spam, "
        "lead with the most surprising angle."
    ),
    "linkedin": (
        "Write a short LinkedIn post about: {topic}\n"
        "Rules: 1 strong opening line, 3 short paragraphs, "
        "a question at the end. Plain language, no buzzwords."
    ),
    "thread": (
        "Outline a 5-tweet thread about: {topic}\n"
        "Each tweet on its own line, numbered, each able to stand alone."
    ),
}

def run(topic, platforms):
    out = {}
    for p in platforms:
        prompt = TEMPLATES[p].format(topic=topic)
        out[p] = generate(prompt)
    return out

Call run("local LLMs for content", ["twitter", "linkedin", "thread"]) and you get a dict of drafts back, generated entirely on your own hardware, for nothing.

The real product wraps this with a UI, a platform picker, and output cleanup — but the engine is genuinely this small. That's the point. Most of the value isn't in the model; it's in the templates that constrain the model into something usable.

The thing that actually makes it good: tight prompts

Smaller local models are less forgiving than a frontier API. A vague prompt to GPT-class hosted models still produces something passable. A vague prompt to an 8B local model produces mush. So the work shifts from "pay for a smarter model" to "write a sharper prompt."

Concretely, what moved quality the most:

Bake the constraints into the template, not the topic. Character limits, tone, structure — put them in the reusable template so every generation inherits them.
Ask for multiple options. "Write 3 hooks" beats "write a hook" — you pick the best and the model explores more of the space.
Keep a Modelfile for a custom system prompt if you find yourself repeating instructions:

FROM llama3.1:8b
SYSTEM "You are a concise copywriter. No clichés, no 'in today's
fast-paced world', no emoji unless asked. Plain, specific language."

ollama create copywriter -f Modelfile

Now copywriter carries that voice everywhere and your per-call prompts get shorter.

The honest tradeoffs

I'm not going to pretend local is strictly better. It isn't.

Long-form coherence is weaker. For short-form (hooks, captions, threads) local models are great. For a 2,000-word essay that needs to hold an argument, a frontier API still wins. Know which job you're doing.
Cold-start latency is real. The first request after the model unloads is slow. Keep it warm if you generate in bursts (ollama run in the background, or a keepalive ping).
You own the ops. No hosted API means no one else patches, scales, or babysits it. For a personal tool that's fine; for a product serving others it's a real consideration.
Hardware matters. An 8B model is comfortable on a modern laptop. Bigger models want more RAM/VRAM. Match the model to your machine instead of reaching for the biggest one.

The trade I'm making — slightly less polish in exchange for $0 cost, full privacy, and unlimited iteration — is overwhelmingly worth it for high-frequency, templated work. That's most of what content generation actually is.

Wrapping up

The headline isn't "local models are magic." It's that for the specific job of churning out daily, templated content, the economics and the workflow both flip in local's favor — and the setup is genuinely a five-minute Ollama install plus a few prompt templates.

I packaged my own version of this into a small tool called Content Studio (idea → batch of posts, runs fully local, $0 to run) if you'd rather not wire it up yourself — it's on Gumroad and the open-source pieces live on my GitHub. And if you want the longer build-in-public breakdowns, I write them up in my newsletter.

But honestly — even if you build your own from the snippets above, do it. Watching your API bill hit $0 while your output goes up is a weirdly satisfying way to start a week.

Local LLMs in 2026: Which Runtime to Run and the Hardware You Need

Nishil Bhave — Sat, 27 Jun 2026 23:13:03 +0000

Local LLMs in 2026: Which Runtime to Run and the Hardware You Need

A few weekends ago I ran a 30-billion-parameter model on a laptop with no internet connection, and it answered my coding questions at reading speed. No API key. No per-token meter ticking. That setup would have been a research-lab flex two years ago. In 2026 it's a default install.

The tooling caught up fast. Ollama, the project most people start with, passed 174,000 GitHub stars and 16,700 forks by mid-2026 (GitHub, 2026), and the llama.cpp engine underneath much of this stack sits north of 73,000 stars of its own. But here's the honest part most "run AI locally" posts skip: a local LLM is still a niche. Menlo Ventures found open-source models hold just 11% of enterprise LLM usage in 2025, down from 19% the year before (Menlo Ventures, 2025). Most production traffic still hits a hosted API.

So who should actually run one, and with what? I've put real hours into Ollama, LM Studio, llama.cpp, and vLLM across a Mac and a mid-range GPU box. This is the working map: the four runtimes that matter, a decision box that tells you which to pick, and the hardware reality check, with the model-versus-model fights pushed out to dedicated guides so this one stays a map and not a maze.

Key Takeaways

Ollama leads on mindshare (174K+ GitHub stars, 2026), but it's a wrapper around llama.cpp, the engine doing the actual work (GitHub, 2026).

The "which runtime" question is really a concurrency question. For one user, Ollama, LM Studio, and llama.cpp are roughly tied; the moment you serve many users at once, vLLM pulls ahead by a wide margin.

At 64 concurrent users, vLLM generated about 44x more tokens per second than llama.cpp in Red Hat's benchmark, while llama.cpp's first token took over three minutes (Red Hat Developer, 2026).

Hardware is the real gate: a 70B model at Q4_K_M quantization wants roughly 40GB of memory, so a 24GB GPU or a 64GB-plus Mac is the practical entry point for the big models.

Privacy and cost are the two honest reasons to go local. 44% of enterprises name data privacy as their top barrier to LLM adoption (Kong, 2025), and local inference has a marginal cost of zero per request.

What Is a Local LLM, and Why Run One in 2026?

A local LLM is a language model that runs entirely on your own machine, with no request leaving your hardware. That matters because privacy is the number one blocker to AI adoption: 44% of enterprises cite data privacy and security as their top barrier to using LLMs (Kong, 2025). When the model lives on your laptop, the prompt never travels.

The other reason is money. A hosted API charges per token forever. A local model charges you once, in hardware, and then runs at zero marginal cost per request. For a developer hammering a model all day, that math flips quickly. Privacy-focused builds keep sensitive code, contracts, or health data on-device, which is exactly why the "private llm" search trend keeps climbing.

There's a third reason that's quieter but real: control. You pick the exact model, the exact quantization, and the exact version. Nothing gets deprecated out from under you. Some people also run local models specifically to step outside hosted guardrails, a sub-audience covered in the guide to the best uncensored and roleplay local LLMs.

Now the anti-hype counterweight. Local does not mean free of tradeoffs. You give up frontier quality, you babysit your own hardware, and you eat the setup cost. Independent 2026 benchmarks put local inference on consumer hardware at roughly 70 to 85% of frontier-model quality on common tasks (Pooya Golchian, 2026). For a lot of work that's plenty. For the hardest reasoning, it isn't. Knowing which bucket your task lands in is the whole game.

What I actually saw: On an M-series Mac and a 12GB RTX 3060 box, the 7B and 8B models felt instant and genuinely useful for autocomplete, summarizing, and quick refactors. The 70B-class models technically loaded, but only on the Mac with enough unified memory, and they crawled. The gap between "runs" and "runs well" is almost entirely a hardware story, which is the section most guides bury.

The Four Local LLM Runtimes Worth Knowing

There are dozens of local-LLM tools, but four cover almost every real use case: Ollama, LM Studio, llama.cpp, and vLLM. Three of them (Ollama, LM Studio, and most desktop apps) are wrappers or GUIs sitting on top of llama.cpp, which crossed 73,000 GitHub stars as the de facto engine for consumer inference (GitHub, 2026). vLLM is the outlier, built for serving at scale.

Here's the honest one-line verdict on each, with the deep setups linked out so this stays a map:

Runtime	What it is	Best for	Interface
Ollama	The easy button. One command pulls and runs a model.	Getting started, scripting, local dev	CLI + API
LM Studio	A polished desktop GUI over the same engine.	Browsing, downloading, and chatting with zero terminal	GUI
llama.cpp	The C/C++ engine everything else is built on.	Max control, custom quantization, embedding in your own app	CLI / library
vLLM	A production inference server with continuous batching.	Serving many users, building a product, throughput	Server / API

Ollama is where most people should start, and the full walkthrough lives in the complete Ollama guide covering setup, models, the web UI, and troubleshooting. If you'd rather click than type, the LM Studio guide on downloading models and how LM Studio compares to Ollama is the better entry point. The Ollama-versus-LM-Studio choice is mostly taste: same engine, different front door.

According to GitHub's own counts, Ollama passed 174,000 stars and 16,700 forks by mid-2026 (GitHub, 2026), making it the most-starred local-LLM runtime by a wide margin. But star counts measure attention, not throughput. The engine underneath, llama.cpp, is what actually turns model weights into tokens, and choosing between the four runtimes is really about how many people you need to serve at once.

The reframe most comparisons miss: "Which runtime is best?" is the wrong question. They mostly run the same models at the same quality. The real question is "how many requests at once?" That single variable, concurrency, is what separates the easy desktop tools from vLLM, and it's the axis the next chart is built on.

Ollama vs llama.cpp vs vLLM: Which Runtime Is Fastest?

It depends entirely on load, and that caveat is the answer. For a single user, Ollama, LM Studio, and llama.cpp are roughly tied, often within a few tokens per second of each other. For many concurrent users, vLLM is in a different league: at 64 simultaneous users it generated about 44 times more tokens per second than llama.cpp in Red Hat's tests (Red Hat Developer, 2026).

Why the gap? Architecture. Tools like Ollama and llama.cpp process requests largely one at a time, which is perfect for a single developer at a keyboard. vLLM uses continuous batching and PagedAttention to interleave many requests across the GPU, so its throughput climbs as load climbs. The flip side: under heavy concurrency, llama.cpp's first token can take more than three minutes because requests queue (Red Hat Developer, 2026). One benchmark clocked vLLM at a peak of 793 tokens per second against Ollama's 41 under the same load, a roughly 19x gap (tech-insider, 2026).

Source: Red Hat Developer and independent vLLM vs Ollama benchmarks, 2026

The practical takeaway is simple. Are you one person at a keyboard? Ollama or LM Studio, and the throughput numbers above barely matter. Are you putting a model behind an app for real users? That's a vLLM job. The cross-runtime comparisons (llama.cpp vs Ollama, vLLM vs Ollama) live here in the pillar on purpose, while the tool-specific deep dives stay in their own guides so nothing cannibalizes.

For one user, the runtime you pick changes your tokens per second by single digits. For a hundred users, it changes them by an order of magnitude. vLLM's continuous batching is the reason a production deployment serving concurrent traffic should not be running on the same tool a solo developer uses for autocomplete (Red Hat Developer, 2026).

Which Local LLM Tool Should You Use?

Pick based on one thing first: who's calling the model. A solo developer wants the easiest path (Ollama or LM Studio); a team shipping a product wants throughput (vLLM); a tinkerer who needs custom quantization wants the raw engine (llama.cpp). Everything else is a detail. Here's the decision box I actually use.

The which-tool-to-pick decision box

If you... Run Why

Want a model running in two minutes Ollama One command pulls and serves a model, with a built-in API

Prefer clicking to typing LM Studio A real GUI to browse, download, and chat, no terminal

Need custom quantization or to embed inference in your own binary llama.cpp The engine itself, minimal dependencies, total control

Are serving many users or building a product vLLM Continuous batching scales throughput with concurrency

Are on an Apple Silicon Mac and want max speed Ollama or LM Studio Both ride Metal/MLX acceleration under the hood

Want to wire a local model into your editor or agents Ollama Its OpenAI-compatible API drops into most tools

If you...	Run	Why
Want a model running in two minutes	Ollama	One command pulls and serves a model, with a built-in API
Prefer clicking to typing	LM Studio	A real GUI to browse, download, and chat, no terminal
Need custom quantization or to embed inference in your own binary	llama.cpp	The engine itself, minimal dependencies, total control
Are serving many users or building a product	vLLM	Continuous batching scales throughput with concurrency
Are on an Apple Silicon Mac and want max speed	Ollama or LM Studio	Both ride Metal/MLX acceleration under the hood
Want to wire a local model into your editor or agents	Ollama	Its OpenAI-compatible API drops into most tools

A point worth stressing: these aren't exclusive. My own setup runs Ollama for day-to-day CLI work and keeps LM Studio around for visually browsing new models before I commit. They share the same model files and the same engine, so switching costs almost nothing. If you want a local model powering an editor like Cursor or driving an agent, Ollama's OpenAI-compatible endpoint is the path of least resistance, and you can connect it to external tools through the Model Context Protocol, which standardizes how AI clients talk to tools and data.

One boundary to keep straight: this is about runtimes, not agents. If you're comparing coding assistants (Cursor, Claude Code, Copilot) rather than the engines that run models, that's a different decision covered in the comparison of AI coding agents across five categories. Runtimes run models. Agents wrap workflows around them.

What Hardware Do You Need to Run a Local LLM?

Memory is the gate, not raw compute. The rule of thumb: a model needs roughly its parameter count in gigabytes at 4-bit quantization, plus overhead. A 7B model at Q4_K_M wants about 5 to 6GB; a 70B model at the same quantization wants roughly 40GB once you account for the KV cache and runtime overhead (SitePoint, 2026). That number decides everything else.

Quantization is the lever that makes local LLMs practical at all. It shrinks the model's weights from 16-bit floats down to 4-bit or 5-bit integers, cutting memory roughly in four. The community settled on Q4_K_M as the sweet spot: the quality hit is tiny for everyday use, a perplexity delta of only about +0.05, though coding and multi-step reasoning can drop 5 to 15% versus full precision (Will It Run AI, 2026). In practice, a well-quantized model is almost always worth it to fit a bigger, smarter model into the same memory.

Source: SitePoint, llmhardware.io, and Will It Run AI quantization guides, 2026

So what should you buy? On the PC side, a 16GB GPU is now the realistic minimum for serious work, and a 24GB card (an RTX 3090 or 4090) is the practical sweet spot because it just barely fits a 70B model at Q4_K_M (SitePoint, 2026). Below that, you're living in 7B-to-13B territory, which is genuinely fine for autocomplete, summarizing, and most coding help. The best GPU for a local LLM is, bluntly, whichever one has the most VRAM you can afford.

A 70B model at Q4_K_M needs roughly 40GB of memory once you include the KV cache, which is why a single 24GB consumer GPU is the practical ceiling for the largest models and a 64GB-plus unified-memory Mac is the realistic alternative (SitePoint, 2026). Match your model's memory footprint to your hardware first, and pick the model second. For which models actually fit and perform, the guide to the best open-source LLMs does the model-by-model breakdown.

Can You Run a Local LLM on a Mac?

Yes, and Apple Silicon is quietly one of the best local-LLM platforms you can buy, thanks to unified memory. On an M-series Mac, the CPU, GPU, and Neural Engine share one high-bandwidth memory pool, so the GPU reads model weights without copying them across a PCIe bus. The M4 Max moves data at about 546 GB/s, which is why it generates tokens faster than any other current Apple chip (SitePoint, 2026).

The catch is the same as everywhere: memory. A 70B model at Q4 is around 43GB, which technically fits a 64GB Mac, but macOS memory pressure spikes and the system starts swapping to SSD, which tanks your tokens per second. For a stable 70B workflow on a Mac in 2026, 128GB of unified memory is the realistic requirement (SitePoint, 2026). For 7B-to-32B models, a 32GB or 48GB Mac is comfortable.

One Mac-specific tip from my own testing: Apple's MLX framework, which both Ollama and LM Studio can use under the hood, runs noticeably faster than generic llama.cpp builds because it's written for Metal and unified memory directly, a meaningful speedup on the same hardware (SitePoint, 2026). If you're on Apple Silicon, prefer an MLX-aware build, and you'll get free speed.

On Apple Silicon, unified memory means the usable model size is gated by your total RAM, not a separate VRAM number, so a 128GB Mac Studio can hold models that would need multiple datacenter GPUs on a PC (SitePoint, 2026). That's the single biggest reason Macs punch above their weight for local inference. The "mac llm" search trend exists for a reason: for many developers, the laptop they already own is the best local-LLM box in the house.

When You Should Not Run an LLM Locally

Be honest about this part, because the local-AI hype skips it. You should not run locally when you need frontier-level reasoning, when you need to serve real production traffic without owning a GPU fleet, or when the engineering time to maintain it costs more than the API bill. Open-source models sit at just 11% of enterprise LLM usage for a reason (Menlo Ventures, 2025): hosted frontier models still win on raw capability.

Source: Menlo Ventures, State of Generative AI in the Enterprise, 2025

The cleanest mental model is a hybrid one. Run small, frequent, privacy-sensitive work locally, and route the hard or high-stakes requests to a hosted frontier model. If you're picking between those frontier options, the Claude Opus vs GPT-5 comparison covers the top hosted pair. And when local stops scaling and you need to fan out across multiple providers cleanly, an AI gateway handles routing, fallback, and the cross-cutting concerns you'd otherwise hand-roll.

Local LLMs win on privacy and cost; hosted models win on peak capability and zero-ops scaling. The honest 2026 answer for most teams is not "local versus cloud" but "local for the 80% that's routine, cloud for the 20% that's hard." Treat it as a routing decision, not a religion.

Which Models Should You Run Locally?

Start with the model that fits your memory, then optimize for your task. A 7B-to-8B model handles autocomplete and summarizing on almost any modern machine; a 70B model is worth the hardware only if you need its reasoning. The open-source field moves monthly, with strong releases from the Llama, Qwen, DeepSeek, Gemma, and Mistral families all runnable through the runtimes above.

This pillar deliberately doesn't run the model-versus-model fights, because those are full guides on their own. Here's where to go:

For coding specifically: the ranked guide to the best LLMs for coding covers which models actually write good code, local and hosted.
For a general open-source pick: the best open-source LLMs breakdown ranks the current field by use case.
For the DeepSeek question: the DeepSeek R1 vs V3 comparison settles which version to run.
For uncensored or roleplay use: the best uncensored and roleplay local LLMs covers the models built without heavy guardrails.
For app ideas: the directory of awesome LLM apps catalogs what people build on top of these models.

The Hugging Face ecosystem now hosts roughly 135,000 GGUF-format models built specifically for local inference, up from a few hundred three years ago (Pooya Golchian, 2026), so the constraint in 2026 is almost never finding a model. It's matching the right one to your hardware and your task. Pick the runtime first, confirm your memory budget, then choose the biggest model that fits comfortably.

Frequently Asked Questions

Is Ollama or LM Studio better for running a local LLM?

They run the same models at the same quality, so it comes down to interface. Ollama is a command-line tool with a built-in API, ideal for scripting and dev work. LM Studio is a GUI for people who'd rather click than type. Ollama leads on adoption with 174,000+ GitHub stars in 2026 (GitHub, 2026).

What hardware do I need to run a local LLM?

Memory is the gate. A 7B model at 4-bit quantization needs about 5 to 6GB, while a 70B model needs roughly 40GB (SitePoint, 2026). A 16GB GPU is the realistic minimum for serious work; a 24GB card or a 64GB-plus unified-memory Mac handles the largest models.

Is a local LLM as good as ChatGPT or Claude?

Not at the frontier, but closer than you'd think. Independent 2026 benchmarks put local inference at roughly 70 to 85% of frontier-model quality on common tasks (Pooya Golchian, 2026). For autocomplete, summarizing, and routine coding that's plenty; for the hardest reasoning, hosted models still lead.

Why run an LLM locally instead of using an API?

Privacy and cost. 44% of enterprises name data privacy as their top barrier to LLM adoption, which a local model removes entirely since no request leaves your machine (Kong, 2025). Local inference also has zero marginal cost per request, which adds up fast for heavy daily use.

Which runtime is fastest for serving many users?

vLLM, by a wide margin. Its continuous batching scales throughput with concurrency, generating about 44 times more tokens per second than llama.cpp at 64 concurrent users (Red Hat Developer, 2026). For a single user, though, Ollama and llama.cpp are roughly tied with it.

The Bottom Line on Running LLMs Locally

Running a local LLM in 2026 is no longer a research project; it's a two-minute install with Ollama and a hardware decision. The runtime you pick matters less than people think for solo use, and a lot more once you're serving real traffic. Get the order right: pick the runtime for your concurrency, size your hardware to the model, then choose the biggest model that fits.

Solo developer? Ollama or LM Studio, a 16GB-plus GPU or a 32GB-plus Mac, and a 7B-to-32B model.
Shipping a product? vLLM, datacenter GPUs, and a real serving setup.
Privacy-driven? Anything local beats a hosted API the moment your data can't leave the building.

If you're ready to actually install one, the next step is the full Ollama setup and model guide, the fastest path from zero to a model running on your own machine. Then come back and match a model to the hardware you've got.