Alan West

Posted on Apr 29

Why Local LLMs Keep Failing at Code Generation (and How to Fix It)

#llm #ai #codegen #productivity

You finally got that 34B parameter model running on your beefy GPU. You feed it a prompt. It confidently writes a function that looks perfect — until you realize it's calling an API that literally doesn't exist. Sound familiar?

I spent the better part of three months trying to make local LLMs my primary coding assistant. I wanted the privacy, the zero-cost inference, the offline capability. What I got was a masterclass in debugging AI-generated hallucinations. But I also figured out what actually works, and more importantly, why local models struggle with code in ways that aren't immediately obvious.

Let's break this down.

The Root Cause: It's Not Just "Model Size"

The knee-jerk explanation is "local models are too small." That's part of it, but it misses the real problem. Code generation fails locally for three interconnected reasons:

1. Quantization destroys code precision. When you squish a 70B model down to 4-bit quantization so it fits in your 24GB of VRAM, you're losing fidelity in the exact places that matter for code. Natural language is forgiving — swap a synonym and meaning is preserved. Code isn't. A single wrong token means a TypeError or a function that doesn't exist.

2. Context window limits kill real-world usefulness. Most local setups give you 4K-8K context reliably. Some models advertise 32K or 128K, but actual performance degrades badly in the upper ranges when running quantized on consumer hardware. Real coding tasks — refactoring a module, understanding how a service connects to three others — need a lot of context.

3. Training data gaps compound everything. Smaller models have seen fewer code examples, fewer Stack Overflow answers, fewer GitHub repos. They're especially weak on newer frameworks, niche libraries, and language-specific idioms that larger training runs would catch.

Step 1: Pick the Right Model for Code (Not the Biggest One)

Not all models are equal for code tasks. A general-purpose 70B chat model will often perform worse at code than a specialized 7B-15B code model. Here's what to look for:

# My current local model selection criteria
priority_order:
  - Code-specialized training (not just general chat)
  - Native context length (not extended via RoPE hacks)
  - Quantization headroom (a 15B at Q6_K > a 70B at Q3_K_M)
  - Instruction-tuned for code completion AND chat

Models fine-tuned specifically on code datasets — things like CodeLlama variants, DeepSeek-Coder, or StarCoder-based models — punch way above their parameter count. A 7B code-specialized model will often outperform a general-purpose 13B model on function generation, bug fixing, and code explanation.

Check the model card for what it was trained on. If the training data section doesn't specifically mention code corpora, keep looking.

Step 2: Fix Your Quantization Strategy

This is where most people silently lose quality. The default advice of "just use Q4_K_M" is fine for chatting about philosophy. It's not fine when a single wrong token breaks your build.

# Instead of this (common default):
llama-server -m codellama-34b.Q4_K_M.gguf -c 4096

# Try a smaller model at higher quantization:
llama-server -m deepseek-coder-v2-lite.Q6_K.gguf -c 8192 \
  --n-gpu-layers 35  # offload as many layers to GPU as fit

The tradeoff math is simple:

Q6_K or Q8_0 on a smaller model = precise token prediction
Q3_K or Q4_K on a bigger model = more knowledge, fuzzier output

For code, precision wins. I'd rather have a model that correctly generates the 15 most common patterns than one that almost gets 50 patterns right.

Test this yourself. Take the same prompt, run it against a 34B-Q4 and a 15B-Q6 five times each. Count the outputs that run without modification. I'll bet the smaller, higher-quant model wins.

Step 3: Engineer Your Prompts Like You Mean It

Local models are way more sensitive to prompt quality than the big cloud APIs. A lazy prompt that works fine with a 400B+ parameter model will crash and burn locally.

What works:

# BAD prompt for local models:
prompt = "Write a function to parse CSV files"

# GOOD prompt for local models:
prompt = """Write a Python 3.11 function that:
- Takes a file path (str) as input
- Reads a CSV file using the csv module from stdlib
- Returns a list of dictionaries where keys are column headers
- Handles the case where the file doesn't exist (raise FileNotFoundError)
- Do NOT use pandas
- Include type hints

Function signature: def parse_csv(filepath: str) -> list[dict[str, str]]:
"""

The difference is night and day. Key principles:

Specify the language and version. Don't let the model guess.
Name the libraries (or explicitly exclude them). Local models love to import packages that don't exist or mix up APIs across libraries.
Provide the function signature. This constrains the output and reduces hallucination.
Be explicit about error handling. Otherwise you'll get either nothing or a seven-layer try/except lasagna.

Step 4: Use Fill-in-the-Middle, Not Chat

Here's a trick that dramatically improved my local code generation quality. Stop using chat mode for inline coding tasks. Most code-specialized models support Fill-in-the-Middle (FIM) — you give them a prefix and suffix, and they generate what goes between.

# FIM format (model-specific, check docs):
prefix = "def calculate_tax(income: float, rate: float) -> float:\n    """
    Calculate tax with standard deduction.\n    """"
suffix = "\n    return round(tax, 2)"

# The model fills in the middle — constrained by both sides
# This produces FAR more accurate code than open-ended chat generation

FIM works because it constrains the model's output on both ends. The model can't hallucinate a wildly different function signature or return type because the suffix already defines the boundary. For autocomplete-style coding — which is honestly 70% of what you want a coding assistant for — FIM with a local model is genuinely competitive.

Most editor integrations (Continue, llama.vim, Tabby) support FIM natively. Use them.

Step 5: Set Up a Validation Pipeline

Here's the uncomfortable truth: even with all the above, local models will still generate broken code sometimes. The fix is to stop trusting and start verifying.

#!/bin/bash
# save as: validate_generated.sh
# Run generated code through basic checks before accepting

FILE=$1

# Syntax check
python3 -c "import ast; ast.parse(open('$FILE').read())" 2>&1
if [ $? -ne 0 ]; then
    echo "FAIL: Syntax error in generated code"
    exit 1
fi

# Type check with mypy (fast, catches hallucinated APIs)
python3 -m mypy "$FILE" --ignore-missing-imports 2>&1
if [ $? -ne 0 ]; then
    echo "WARN: Type errors detected"
fi

# Run any existing tests that touch the modified module
python3 -m pytest tests/ -x --timeout=10 2>&1

I run something like this automatically in my editor whenever I accept a generated code block. It catches the most common local LLM failure — confidently calling functions or methods that don't exist on an object. mypy is particularly good at catching these.

When to Bail: Know the Limits

Even with all these fixes, local LLMs have hard limits you should respect:

Multi-file refactoring: Needs too much context. Don't even try.
Debugging complex runtime errors: The model needs to understand state, call stacks, and timing. It won't.
Anything requiring up-to-date API knowledge: If the library was updated after the model's training cutoff, you'll get plausible-looking code for an API version that no longer exists.
Generating tests for complex business logic: The model doesn't understand your domain. It'll write tests that pass but test nothing meaningful.

For these tasks, you're better off using local models for smaller subtasks — generate a single function, write a type definition, convert a data structure — and doing the architectural thinking yourself.

The Honest Summary

Local LLMs for coding aren't useless. They're just unforgiving. You can't use them the way you'd use a cloud model — fire off a vague prompt and get back working code. You need to pick specialized models, preserve quantization quality, write precise prompts, use FIM for inline completion, and validate everything.

Is it more work? Yeah. But you get offline capability, complete privacy, zero API costs, and — once you dial it in — a surprisingly capable coding assistant that runs on hardware you already own.

The trick is matching the tool to the task. Use local models for the 80% of coding work that's pattern-matching and boilerplate. Keep your judgment for the 20% that requires actual understanding.

Top comments (1)

david duymelinck • Apr 29 • Edited

When I read the prompt I was thinking, how many keystrokes are you saving?
It looks like it takes as much time to describe the function you want to generate than to write the code yourself?

I'm wondering if you could get away with just

Create a function with the signature: `def parse_csv(filepath: str) -> list[dict[str, str]]`.

read following files as guidance:

- dependencies/python.md
- dependencies/csv.md
- errorHandling/files.md

Basically I took the idea of skills but made it more explicit, instead of letting an agent decide what to read and what not.
As I see it most skills contain too much information anyway.

The prompt is a bit shorter, and the information can be reused.