DEV Community

Cover image for Local LLMs for Daily Coding in 2026: The Complete Engineer's Guide to Replacing Claude & GPT
Manoranjan Rajguru
Manoranjan Rajguru

Posted on

Local LLMs for Daily Coding in 2026: The Complete Engineer's Guide to Replacing Claude & GPT

Local LLMs for Daily Coding in 2026: The Complete Engineer's Guide to Replacing Claude & GPT with a Private AI Dev Stack

Local LLM Developer Workstation


Table of Contents

  1. The Great Local LLM Migration
  2. Why Developers Are Going Local in 2026
  3. The Model Landscape: Choosing Your Local LLM
  4. Hardware: What You Actually Need
  5. The Inference Stack: llama.cpp + Vulkan
  6. KV Cache Mastery: The preserve_thinking Breakthrough
  7. Agentic Coding Harnesses: Pi & OpenCode
  8. Performance Reality Check: Local vs. Frontier Models
  9. Security Sandboxing: Running AI Agents Safely
  10. Ollama + MLX: The Apple Silicon Fast Lane
  11. The Future: Where Local AI Coding Goes Next
  12. Conclusion

1. The Great Local LLM Migration

On June 16, 2026, a single Hacker News thread cracked open something the AI industry had been quietly building toward for over a year. The post title was almost mundane: "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?" Within hours, it had climbed to 759 upvotes and 363 comments — one of the most engaged AI threads of the year.

The responses weren't theoretical. Developers weren't asking if local LLMs could replace frontier models for coding. They were sharing production setups: exact model names, hardware configurations, inference engine flags, KV cache tuning tricks, and honest performance comparisons. This wasn't a hobbyist experiment. This was a field report from engineers who had already made the switch and were debugging the edges of their production setups.

The timing wasn't coincidental. In mid-2026, the local LLM landscape crossed a threshold. A convergence of factors — dramatically better open-weight models, consumer hardware with massive unified memory, mature inference tooling, geopolitical shocks to cloud AI availability, and real privacy concerns about sending proprietary code to external APIs — pushed local LLM for coding from a weekend hack into a viable daily workflow.

This guide is for the engineer who wants to understand that convergence technically, set up a production-grade local AI coding stack, and know exactly where the trade-offs are. We'll cover model selection, hardware math, inference engine configuration, the critical KV cache breakthrough that makes agentic coding viable locally, and the security architecture you need to run AI agents without creating a blast radius.


2. Why Developers Are Going Local in 2026

Before we talk about tooling, it's worth being precise about why this migration is happening, because the reasons shape how you should design your local stack.

2.1 Privacy and Code Sovereignty

Every line of code you send to a cloud-hosted LLM API is, at minimum, processed on hardware you don't control, by a company operating under a legal jurisdiction that may not align with your enterprise's compliance requirements. For engineers at financial institutions, defense contractors, healthcare companies, or any organization handling trade secrets, this creates genuine legal exposure. Local inference eliminates the attack surface entirely — your code never leaves the machine.

2.2 API Cost at Scale

At current pricing (verify before publishing), heavy API usage with frontier models runs $100–500+/month per developer at scale. A team of 20 engineers doing intensive AI-assisted development can burn through tens of thousands of dollars annually. After the upfront hardware investment, local inference is effectively free. The break-even on a Mac Studio M4 Ultra with 192GB unified memory is measured in months, not years.

2.3 Geopolitical and Supply-Side Shocks

2026 has delivered several reminders that access to frontier AI is not guaranteed. Anthropic's most powerful model — internally codenamed Fable 5 — was blocked from release by the US White House over safety concerns, leaving developers on a waitlist for a tool they had built workflows around. Separately, GitHub Copilot's infrastructure hit an AI capacity crunch severe enough that Microsoft began routing traffic to AWS. When your development workflow depends on a third-party API you don't control, it can be interrupted for reasons that have nothing to do with your usage patterns.

2.4 Latency and Offline Operation

Local models have zero network round-trip latency. On modern unified memory hardware, inference latency for a 35B MoE model with 3B active parameters is often lower than a network call to a cloud API under load. And critically, local models work on planes, in secure facilities, and in network-constrained environments where cloud APIs are simply unavailable.

2.5 Vendor Lock-In and Model Deprecation

Cloud providers deprecate models with limited notice. Every Claude → Sonnet → Opus transition forces you to re-test and re-tune your prompting strategy. With a local model, you control the exact checkpoint. Your models.ini doesn't change unless you change it.


3. The Model Landscape: Choosing Your Local LLM

The question is no longer "can local models code?" — it's "which local model is right for which coding task?" Here's the current landscape as of mid-2026.

3.1 Qwen3.6 35B-A3B: The Daily Driver

The Qwen3.6 35B-A3B is the clear community consensus for daily AI-assisted coding. It's a Mixture-of-Experts (MoE) architecture with 35 billion total parameters but only 3 billion active parameters at inference time. This is the critical distinction: the model routes tokens through a sparse expert selection mechanism, meaning it performs like a 35B model in terms of knowledge and capability but computes like a 3B model in terms of FLOPs and memory bandwidth.

The practical result on a Mac Studio with 128GB unified memory: fast, responsive inference that makes it viable for interactive agentic coding workflows. Community benchmarks suggest roughly a 5x productivity speedup over coding without AI assistance — compared to ~15x with Claude Opus, but at zero ongoing cost.

Qwen3.6 also introduces Hybrid Thinking Modes: a thinking mode where the model reasons step-by-step before answering (ideal for complex architecture decisions or hard algorithms), and a non-thinking mode for quick completions and boilerplate. This gives developers explicit token-budget control that frontier APIs don't always expose.

3.2 Qwen3.5 122B-A10B: The Power Mode

For genuinely hard problems — complex refactors across large codebases, novel algorithm design, debugging deeply nested async code — the Qwen3.5 122B-A10B is worth the performance cost. With 10B active parameters, it's approximately 4x slower than the 35B model on the same hardware, but it handles ambiguous or underspecified tasks significantly better and loses track of long-range context far less often.

3.3 Gemma 4 31B: The Generalist

Google's Gemma 4 31B is the community favorite for non-coding tasks: documentation writing, commit message generation, code explanation, architecture Q&A, and translation. It complements Qwen3.6 well in a multi-model local setup where you route by task type.

3.4 GPT-OSS 120B: Speed When You Need It

For high-throughput, latency-sensitive use cases where raw intelligence matters less than fast token generation, GPT-OSS 120B (OpenAI's open-weight release) provides excellent tokens-per-second on well-provisioned hardware — useful for batch code review or documentation generation pipelines.

Model Comparison Table

Model Active Params Context Best Use Case Hardware Minimum
Qwen3.6 35B-A3B ~3B 128K Daily coding (sweet spot) 36GB unified memory
Qwen3.5 122B-A10B ~10B 128K Complex refactors 128GB unified memory
Gemma 4 31B 31B (dense) 128K Docs, explanation 64GB unified memory
GPT-OSS 120B 120B (dense) 32K High-speed generation 128GB+
Qwen3.6 27B 27B (dense) 128K Mid-tier alternative 48GB unified memory

4. Hardware: What You Actually Need

The hardware requirements for local LLM for coding are more accessible than most engineers expect — but the memory math is non-negotiable. Understanding why memory matters so much helps you make smarter hardware choices.

Hardware Comparison for Local LLM Inference

4.1 The Unified Memory Advantage

The single most important hardware characteristic for local LLM inference is memory bandwidth, not raw compute. LLMs at inference time are memory-bandwidth-bound: the bottleneck is how fast you can shuttle model weights between memory and compute units, not how fast you can execute the matrix multiplications themselves.

This is why Apple Silicon M-series chips and AMD's Strix Halo architecture are so compelling for local inference. They use unified memory, where CPU and GPU share a single high-bandwidth memory pool. There's no PCIe bottleneck, no need to fit the model within discrete GPU VRAM, and no memory copies between host and device. A discrete GPU setup with two high-end cards has to manage model weight sharding across them — unified memory architectures just see one flat pool.

For comparison:

  • A discrete GPU rig (2× RTX 5090 with 32GB VRAM each) gives you ~64GB at ~2 TB/s combined bandwidth
  • An Apple M4 Ultra Mac Studio with 192GB unified memory gives you ~800 GB/s across the entire pool
  • An AMD Strix Halo laptop with 128GB gives you ~256 GB/s

4.2 Memory Requirements by Quantization

Model weights are quantized to reduce memory footprint. For Qwen3.6 35B, the common quantization options and their trade-offs:

Quantization Bits/Weight Memory (35B) Quality
F16 (full precision) 16 ~70GB Baseline
Q8_0 8 ~35GB Minimal loss
Q4_K_M 4.5 ~22GB Low loss (recommended)
IQ4_XS 4.25 ~20GB Low-medium loss
Q3_K_M 3.5 ~16GB Medium loss

For daily coding, Q4_K_M is the standard recommendation: fits on a 36GB MacBook Pro while maintaining output quality close to full precision. If you have 64GB+, run Q8_0 for noticeably better code generation, especially on less common frameworks.

4.3 Recommended Hardware Tiers

Tier 1 — MacBook Pro M4 Pro (36GB): The minimum viable local LLM development machine. Runs Qwen3.6 35B-A3B at Q4_K_M comfortably. Handles daily coding tasks; expect ~20–30 tokens/second.

Tier 2 — Mac Studio M4 Max/Ultra (128–192GB): The sweet spot. Runs Qwen3.6 at Q8_0 or higher, can switch to Qwen3.5 122B for hard tasks, and runs multiple models concurrently. The top choice for serious local AI development.

Tier 3 — AMD Strix Halo Laptop (128GB): A compelling Windows/Linux portable alternative. Comparable memory to a mid-range Mac Studio, with Vulkan/ROCm flexibility. Some Qwen MoE optimizations still lag Apple's Metal/MLX path.


5. The Inference Stack: llama.cpp + Vulkan

A quantized model file is inert without an inference engine to run it. As of mid-2026, llama.cpp running on the Vulkan backend is the community-validated choice for non-Apple hardware.

5.1 Why Vulkan Over ROCm?

On AMD Strix Halo hardware running Qwen hybrid MoE models, the Vulkan backend in llama.cpp consistently outperforms ROCm — often matching or exceeding it in tokens-per-second while being substantially more stable. The primary reason: ROCm's kernel fusion and memory management pathways for sparse MoE layers haven't been as aggressively optimized as the Vulkan compute shaders used by llama.cpp's GGML backend. The HN community has largely converged on Vulkan for non-CUDA, non-Apple setups.

5.2 Building llama.cpp with Vulkan

# Clone llama.cpp — stay on main for latest Qwen3.x + MoE support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Vulkan support
cmake -B build \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON

cmake --build build --config Release -j$(nproc)
Enter fullscreen mode Exit fullscreen mode

5.3 Downloading a Quantized Model

# Install huggingface-cli
pip install huggingface_hub

# Download Qwen3.6 35B-A3B at Q4_K_M (~22GB) — good for 36GB+ systems
huggingface-cli download \
  Qwen/Qwen3.6-35B-A3B-GGUF \
  qwen3.6-35b-a3b-q4_k_m.gguf \
  --local-dir ./models/qwen3.6-35b

# Or Q8_0 (~38GB) for 128GB+ systems — better code quality
huggingface-cli download \
  Qwen/Qwen3.6-35B-A3B-GGUF \
  qwen3.6-35b-a3b-q8_0.gguf \
  --local-dir ./models/qwen3.6-35b
Enter fullscreen mode Exit fullscreen mode

5.4 Launching the llama.cpp Server

The llama.cpp server exposes an OpenAI-compatible REST API, making it a drop-in backend for any tool that supports an OpenAI endpoint — OpenCode, VS Code extensions, custom Python scripts, and more.

# Launch llama.cpp server with Vulkan backend
./build/bin/llama-server \
  --model ./models/qwen3.6-35b/qwen3.6-35b-a3b-q4_k_m.gguf \
  --n-gpu-layers 99 \      # Offload all layers to GPU/unified memory
  --ctx-size 32768 \       # Context window (start at 32K, max 128K)
  --host 127.0.0.1 \
  --port 8080 \
  --parallel 1 \           # Single inference stream for coding sessions
  --batch-size 512 \
  --flash-attn \           # FlashAttention — critical for long contexts
  --verbose
Enter fullscreen mode Exit fullscreen mode

Key flags in detail:

  • --n-gpu-layers 99: Offloads all transformer layers to GPU/unified memory. Lower this if you hit OOM.
  • --flash-attn: Enables FlashAttention, which significantly reduces the memory required during long-context inference. Always enable this.
  • --ctx-size 32768: Start here for stability. Qwen3.6 supports 128K but large contexts require proportionally more memory.
  • --parallel 1: One inference stream is optimal for interactive coding sessions. Increase only for batch workloads.

5.5 Verifying the Server

# Test the OpenAI-compatible endpoint
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role": "user", "content": "Write a Python function to binary search a sorted list"}
    ],
    "temperature": 0.1,
    "max_tokens": 512
  }'
Enter fullscreen mode Exit fullscreen mode

6. KV Cache Mastery: The preserve_thinking Breakthrough

This section covers what may be the single most important technical optimization for making local LLMs viable for agentic coding — a breakthrough that only arrived with Qwen3.6.

KV Cache Flow with preserve_thinking

6.1 The Problem: Why Agentic Sessions Got Slow

In a typical agentic coding session, the conversation context accumulates a long sequence of interleaved reasoning traces and tool calls: the model thinks, calls a file-read tool, processes the result, thinks again, makes a code edit, checks the output, and so on. After 10–20 such turns, the context can hold tens of thousands of tokens.

Here's the critical problem with pre-Qwen3.6 local MoE models: reasoning/thinking tokens were not persisted across turns. When you sent a new message, the inference engine would strip the reasoning traces from all prior turns before re-encoding the context. This meant every new turn required re-processing the stripped version of the prior conversation — and the KV cache couldn't be reused because the token sequence was effectively different each time.

In practice, a 20-turn agentic coding session on a large codebase could take 10–15 minutes of pure inference time, killing the interactive feel that makes agentic coding valuable.

6.2 The Solution: preserve_thinking

Qwen3.6 is the first widely-deployed local model trained to support preserving thinking traces across turns. When enabled, reasoning tokens from previous turns are kept in the context window rather than stripped. Two things change:

  1. The KV cache from prior turns remains valid and can be reused — the engine only processes new tokens, not the entire history.
  2. The model retains its prior reasoning, enabling coherent multi-turn problem solving without re-deriving the same conclusions on each turn.

The trade-off: preserved thinking uses more context window space. But the speedup in long agentic sessions is substantial — community members on the HN thread report 60–80% reduction in per-turn latency for long sessions after enabling this flag.

6.3 Configuring preserve_thinking

# models.ini — llama.cpp model configuration
[qwen3.6-35b-a3b]
model-path = ./models/qwen3.6-35b/qwen3.6-35b-a3b-q4_k_m.gguf
n-gpu-layers = 99
ctx-size = 65536
flash-attn = true

# THE critical setting for agentic sessions:
# Preserves reasoning traces across turns, enabling KV cache reuse
# Prevents per-turn context re-processing in long agentic workflows
chat-template-kwargs = {"preserve_thinking": true}
Enter fullscreen mode Exit fullscreen mode

6.4 KV Cache Snapshots and Hybrid Attention

Modern LLMs combine full attention (O(n²), fully cacheable) with local/sliding window attention (O(n), requires state snapshots). When the engine needs to re-process context, it falls back to the nearest saved state snapshot. If no suitable snapshot exists, it restarts from the beginning — the worst case.

The practical implication: keep your llama.cpp build current. Several snapshot management bugs for MoE hybrid models were fixed in the llama.cpp main branch in Q1–Q2 2026. Running outdated builds is the leading cause of unexpected context re-processing and the associated latency spikes.

# Always pull latest for MoE model support
git pull origin master
cmake --build build --config Release -j$(nproc)

# Verify your build version
./build/bin/llama-server --version
Enter fullscreen mode Exit fullscreen mode

7. Agentic Coding Harnesses: Pi & OpenCode

A raw llama.cpp server is an inference endpoint — not a coding assistant. You need an agentic coding harness that manages conversation history, tool calling (file read/write, shell execution, git operations), and the control loop that makes an LLM into a software engineering agent.

7.1 The Pi Coding Harness

Pi is a lightweight, privacy-first coding agent. The recommended configuration runs Pi in a container, connecting to llama.cpp in a separate container, creating a sandboxed development environment where your code never touches the network.

# docker-compose.yml — Production local AI coding stack
version: "3.9"

services:
  # llama.cpp inference server — isolated inference network only
  llama-server:
    image: ghcr.io/ggerganov/llama.cpp:server-vulkan
    volumes:
      - ./models:/models:ro
      - ./models.ini:/app/models.ini:ro
    command: >
      --model /models/qwen3.6-35b/qwen3.6-35b-a3b-q4_k_m.gguf
      --n-gpu-layers 99
      --ctx-size 65536
      --host 0.0.0.0
      --port 8080
      --flash-attn
    networks:
      - ai-net

  # Pi coding agent — sandboxed to single project directory
  pi-agent:
    image: pi-coding:latest
    environment:
      - OPENAI_API_BASE=http://llama-server:8080/v1
      - OPENAI_API_KEY=none
      - OPENAI_MODEL=local
    volumes:
      # Bind ONLY the specific project being worked on
      - ./my-project:/workspace:rw
      # Pi config and session history — persisted across restarts
      - ~/.pi:/root/.pi:rw
    working_dir: /workspace
    networks:
      - ai-net

networks:
  ai-net:
    driver: bridge
    internal: true  # No external internet — fully air-gapped
Enter fullscreen mode Exit fullscreen mode

7.2 OpenCode: The Browser-Based Workflow

OpenCode ships with a built-in web UI, terminal, file browser, git diff viewer, and a mobile-friendly interface — making it possible to start a change on your workstation, review the diff on your phone, and merge the PR from anywhere. For homelab and self-hosted development environments, it's particularly compelling.

# Install OpenCode
npm install -g opencode-ai

# Configure to use local llama.cpp backend
cat > ~/.config/opencode/config.json << 'EOF'
{
  "provider": "openai-compatible",
  "baseURL": "http://127.0.0.1:8080/v1",
  "apiKey": "none",
  "model": "local",
  "maxTokens": 16384
}
EOF

# Launch OpenCode server with web UI on port 3000
opencode serve --port 3000 --workspace /path/to/your/project
Enter fullscreen mode Exit fullscreen mode

7.3 The Effective Workflow Pattern

The most effective local AI coding workflow that has emerged from community practice follows this sequence:

  1. Plan precisely. Describe the feature or bug fix to the agent with explicit detail. Unlike frontier models, local models don't gracefully infer unstated architectural preferences — vague prompts get vague results.
  2. Use thinking mode for hard problems. Toggle the Qwen3.6 thinking mode on for complex algorithms or architecture decisions. Use non-thinking mode for boilerplate and simple completions.
  3. Review every diff before commit. Never let the agent push directly to your main branch. Always read the diff.
  4. Gate with CI. Point the agent at failing CI logs with "here's the error, fix it." This tight feedback loop is where local models excel.

8. Performance Reality Check: Local vs. Frontier Models

Let's be honest about where local models stand today. The framing that best captures it: running Qwen3.6 35B locally is like working with a well-read junior engineer who needs precise direction, while Claude Opus is like working with a senior architect who thinks alongside you.

Local vs Frontier Model Performance Comparison

8.1 Where Local Models Win

  • Well-specified tasks: Unit tests for existing functions, boilerplate CRUD endpoints, SQL query generation, regex, config file generation — local models perform close to frontier quality here.
  • Known, popular frameworks: Django, Rails, Spring, React, FastAPI — anything heavily represented in open training data.
  • Short-context speed: For under 4K tokens of context, Qwen3.6 35B-A3B often responds faster than a network call to Claude under API load.
  • Privacy-sensitive code: For IP you can't send to external APIs, local is the only option.

8.2 Where Frontier Models Still Lead

  • Ambiguous tasks: Frontier models make better architectural assumptions when the prompt leaves details open. Local models take the path of least resistance.
  • Niche or obscure frameworks: Wagtail, Elixir Phoenix, Zig — limited training data means local models struggle more here.
  • Long-horizon coherence: In 50+ turn sessions with 100K+ tokens, frontier models maintain consistency better. Local models more often lose track of earlier decisions.
  • Loop recovery: Local models enter tool-calling loops more frequently and need more explicit system prompt guidance to recover.

8.3 Productivity Numbers

Based on community reports (verify with your own workflow before quoting):

Scenario Frontier (Claude Opus) Local (Qwen3.6 35B)
Productivity multiplier ~15x ~5x
Monthly cost (heavy use) $200–500+ (verify) ~$0 after hardware
Privacy Code sent to external API Fully offline
Availability Depends on API uptime Always available
Setup complexity Low Medium–High

9. Security Sandboxing: Running AI Agents Safely

An AI coding agent with file system and shell access is a significant security surface. These principles should govern every local AI coding setup, regardless of how much you trust the underlying model.

9.1 Least-Privilege File System Mounts

Never bind-mount your home directory or project root into an agent container. Bind only the specific subdirectory the agent needs for the current session.

# ❌ Bad: Full home directory access
docker run -v ~/:/workspace pi-agent

# ✅ Good: Bind only the specific project, read-write
docker run \
  -v ~/projects/my-api:/workspace:rw \
  -v ~/.pi:/root/.pi:rw \   # Pi config only — not credentials
  pi-agent
Enter fullscreen mode Exit fullscreen mode

9.2 Network Segmentation

The llama.cpp inference server has zero need for internet access — it's pure compute. The coding agent similarly should only reach the inference server and your local Git server, not the open internet. Use Docker's internal: true flag to enforce this at the network layer.

If you genuinely need documentation lookup during a session, use an egress proxy with domain allowlisting rather than open internet access from the agent container.

9.3 Credential Isolation

Never include SSH keys, API tokens, .env files, or cloud credentials in any directory bind-mounted into the agent container. For Git push access, provision a dedicated SSH key pair for the agent with write permissions scoped to specific repositories — not your personal key.

9.4 PR-Based Review as a Safety Gate

The golden rule from the community: size your trust to your review confidence. The agent pushes to feature branches. You merge PRs. GitOps handles deployments downstream. This ensures every AI-generated change passes a human review before it reaches any system with real blast radius.


10. Ollama + MLX: The Apple Silicon Fast Lane

If you're on an Apple Silicon Mac, Ollama with the MLX backend is the recommended path as of mid-2026. In March 2026, Ollama shipped MLX support in preview, delivering Apple Silicon-native inference that outperforms the Metal backend for most models and makes the barrier to entry near-zero.

10.1 Installation and Model Setup

# Install Ollama (includes MLX support on Apple Silicon)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3.6 35B-A3B — Ollama fetches the right quant automatically
ollama pull qwen3.6:35b-a3b-q4_k_m

# Pull Gemma 4 31B for docs and general tasks
ollama pull gemma4:31b

# Verify models are loaded
ollama list
Enter fullscreen mode Exit fullscreen mode

10.2 Using the OpenAI-Compatible API

# Use Ollama's OpenAI-compatible endpoint directly
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:35b-a3b-q4_k_m",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert software engineer. Be precise and concise."
      },
      {
        "role": "user",
        "content": "Refactor this to async/await: def fetch_data(url): return requests.get(url).json()"
      }
    ],
    "temperature": 0.1
  }'
Enter fullscreen mode Exit fullscreen mode

10.3 Multi-Model Routing in Python

One underused pattern: routing requests between models based on task complexity signals. Qwen3.6 for the daily driver, escalate to Qwen3.5 122B for genuinely hard tasks.

import httpx

OLLAMA_BASE = "http://localhost:11434/v1"

# Signals that indicate a task warrants the heavier model
COMPLEXITY_SIGNALS = [
    "refactor", "architecture", "design pattern",
    "optimize", "performance", "entire codebase",
    "thread-safe", "distributed", "concurrency"
]

def chat(prompt: str, use_heavy_model: bool = False) -> str:
    """Route to Qwen3.6 35B for daily tasks, Qwen3.5 122B for complex ones."""
    model = (
        "qwen3.5:122b-a10b-q4_k_m" if use_heavy_model
        else "qwen3.6:35b-a3b-q4_k_m"
    )
    response = httpx.post(
        f"{OLLAMA_BASE}/chat/completions",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.1,
            "max_tokens": 2048,
        },
        timeout=120.0,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]


def auto_route(prompt: str) -> str:
    """Automatically select model based on complexity signals in the prompt."""
    is_complex = any(sig in prompt.lower() for sig in COMPLEXITY_SIGNALS)
    model_used = "122B (complex)" if is_complex else "35B (standard)"
    print(f"[router] Routing to {model_used}")
    return chat(prompt, use_heavy_model=is_complex)


# Example usage
if __name__ == "__main__":
    # Routes to 35B — simple task
    print(auto_route("Write a unit test for a binary search function"))

    # Routes to 122B — complexity signal detected
    print(auto_route("Refactor the entire authentication module to be thread-safe"))
Enter fullscreen mode Exit fullscreen mode

11. The Future: Where Local AI Coding Goes Next

The trajectory of local LLM for coding is clear, and the rate of progress suggests the capability gap with frontier models will continue to close.

Unified memory will keep scaling. Apple Silicon chips are on a roadmap toward 256GB and beyond in prosumer configurations. AMD's Strix Halo successors are similarly scaling. More memory means larger models at higher precision — the Qwen3.5 122B will become a viable daily driver as memory grows.

MoE efficiency will improve. The Qwen3.6 generation proved that MoE architectures with sparse activation can deliver disproportionate capability relative to active compute. The next generation of open-weight models will push this further — larger total parameter counts, smaller active parameter footprints, faster iteration.

Tool use and long-horizon agentic capability will catch up. The current gap between local models and frontier models is most acute in complex, multi-turn agentic tasks. As model training increasingly incorporates agentic synthetic data, tool use traces, and multi-turn reasoning, expect local model capability here to improve substantially in the next 12 months.

Inference will get hardware-specific. MLX on Apple Silicon, hardware-specific kernels for Strix Halo, and ZLUDA for alternative GPU architectures are making inference progressively faster on specific hardware. The generic Vulkan/CPU path will remain for portability, but hardware-specific paths will deliver significantly better tokens-per-second for developers who optimize their stack.

Privacy regulations will accelerate enterprise adoption. As AI data handling regulations tighten globally, enterprise procurement of local inference hardware will accelerate. Engineers who build deep fluency with local LLM stacks today are positioning themselves for a skill that enterprise teams will urgently need.


12. Conclusion

The Hacker News thread that surfaced today — 759 engineers comparing notes on replacing Claude and GPT with local models for daily coding — isn't a sign that cloud AI is dead. It's a sign that local LLM for coding has matured enough to be a serious engineering choice, not a compromise.

The stack is now well-defined: Qwen3.6 35B-A3B as your daily driver, llama.cpp with Vulkan (or Ollama with MLX on Apple Silicon) as your inference engine, preserve_thinking enabled to keep your KV cache efficient across long agentic sessions, and Pi or OpenCode as your coding harness — all running in least-privilege containers with PR-based review before anything reaches production.

You won't match Claude Opus on ambiguous architectural tasks or obscure frameworks. But for well-specified work, you'll get a 5x productivity multiplier at zero marginal cost, with your code never leaving your machine. And when the next API outage hits, the next model gets blocked, or the next pricing change lands in your inbox — you'll already be running.

Start today. Install Ollama, pull qwen3.6:35b-a3b-q4_k_m, and wire it up to OpenCode or your existing VS Code setup. You don't need a Mac Studio to get started — a 36GB MacBook Pro is enough to build intuition for the stack. Run a real task: write a test, refactor a function, generate a migration. See where the gaps are. The migration is already happening, and now you have the technical roadmap to do it right.


Have you replaced Claude or GPT with a local model for daily coding? Share your hardware setup, model choices, and hard-won lessons in the comments — the community knowledge base on this is moving fast, and your experience could be the breakthrough the next engineer needs.


Tags: llm, ai, machinelearning, productivity, devtools, privacy, qwen, localai, llama, coding

Top comments (0)