DEV Community

Cover image for Training Small LLMs to Edit Code Instead of Generating It
augustine Egbuna
augustine Egbuna

Posted on • Originally published at fivenineslab.com

Training Small LLMs to Edit Code Instead of Generating It

You've hit the wall with 2B parameter models trying to write functions from scratch. The output is syntactically broken, logically confused, or just hallucinates APIs that don't exist. But what if you stopped asking these models to be creative and instead treated them as intelligent diff generators?

I've run this exact experiment with Qwen2.5-Coder-1.5B and Phi-3-mini on an RTX 3060. The insight is simple: small models fail at generation but succeed at transformation. Give them a working reference implementation from GitHub and ask them to modify it for your specific use case. The model operates in the space of edits, not invention.

Why Small Models Fail at Code Generation

A 2B model has seen enormous amounts of code during pretraining, but it lacks the parameter capacity to reliably reproduce complex patterns. When you prompt "write a Redis connection pool in Python with retry logic", the model must:

  1. Recall the Redis client API surface
  2. Remember exception hierarchies
  3. Generate retry backoff logic
  4. Handle connection lifecycle edge cases
  5. Produce syntactically valid, idiomatic Python

That's too many constraints for 2 billion parameters to satisfy simultaneously. You get code that looks plausible but fails on import redis or forgets to close connections.

But transformation is different. If you retrieve an existing Redis pool implementation and ask the model to "add exponential backoff to the retry logic", you've anchored it. The API calls are already there. The structure exists. The model only needs to insert a specific pattern it's seen hundreds of times.

The Retrieval + Edit Architecture

Here's the pipeline I tested with Phi-3-mini (3.8B) on an RTX 3060 Ti:

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Index GitHub code (one-time setup)
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
qdrant = QdrantClient(path="./code_index")

def index_github_snippets(repo_files):
    """Embed and store code snippets with metadata"""
    for file_path, content in repo_files:
        chunks = split_into_functions(content)  # Parse AST, extract functions
        embeddings = embedder.encode([c['code'] for c in chunks])
        qdrant.upsert(
            collection_name="code_snippets",
            points=[{
                "id": idx,
                "vector": emb.tolist(),
                "payload": {"code": chunk['code'], "file": file_path}
            } for idx, (emb, chunk) in enumerate(zip(embeddings, chunks))]
        )

# Retrieval + edit at inference
def edit_code_for_task(user_query):
    query_emb = embedder.encode(user_query)
    results = qdrant.search(
        collection_name="code_snippets",
        query_vector=query_emb,
        limit=3
    )

    reference_code = results[0].payload['code']

    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.float16,
        device_map="cuda"
    )
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

    prompt = f"""Edit this code to: {user_query}

Reference implementation:
Enter fullscreen mode Exit fullscreen mode


python
{reference_code}


Modified version:"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
Enter fullscreen mode Exit fullscreen mode


shell

The key is the prompt structure. You're not asking "write code to X". You're asking "here's code that does Y, modify it to do X". This constrains the solution space dramatically.

Performance Numbers on Low-End Hardware

On an RTX 3060 Ti (8GB VRAM), here's what I measured:

Phi-3-mini-4k-instruct (3.8B params, FP16)

  • Inference time: 2.1s for 256 tokens
  • VRAM usage: 7.2GB with batch size 1
  • Success rate (code runs without errors): 73% on my test set of 50 tasks

Qwen2.5-Coder-1.5B-Instruct (FP16)

  • Inference time: 1.3s for 256 tokens
  • VRAM usage: 3.1GB
  • Success rate: 61%

Compare this to the same models generating from scratch (no reference code):

  • Phi-3: 41% success rate
  • Qwen2.5-Coder: 29% success rate

The gap is huge. Editing existing code nearly doubles the reliability.

What This Actually Looks Like in Production

I deployed this as a VSCode extension prototype. The workflow:

  1. User highlights code and types a natural language edit request
  2. Extension embeds the request + existing code context
  3. Searches local Qdrant index (seeded with 50k Python functions from popular repos)
  4. Retrieves top-3 similar implementations
  5. Sends reference code + edit instruction to local Phi-3 instance via llama.cpp server
  6. Returns diff overlay in editor

The llama.cpp server runs with:

./server \
  -m phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 4096 \
  -ngl 35 \
  --host 0.0.0.0 \
  --port 8080
Enter fullscreen mode Exit fullscreen mode

Quantization (Q4_K_M) drops VRAM to 2.4GB. Inference is 3.2s on an RTX 2060. That's fast enough for an interactive editing assistant.

The Limitations You'll Hit

This isn't a magic solution. The model still hallucinates when:

  • Retrieved code is too different from the target task (embedder failure)
  • Edit requires understanding distant context (small context windows)
  • Task involves proprietary APIs not in the training data

I found the sweet spot is refactoring, adding error handling, changing API versions, and adapting patterns. The model is bad at architectural decisions or designing new abstractions.

Also, code retrieval quality matters more than model size. A better embedding model (say, Salesforce/SFR-Embedding-Mistral) improves success rate by 8-12 percentage points. The model can only edit what you feed it.

Should You Build This?

If you're running on constrained hardware and need a local coding assistant, yes. The retrieval + edit pattern is the only way I've found to get reliable output from sub-4B models.

But if you have access to larger models (CodeLlama 13B, DeepSeek-Coder 6.7B), stick with those. They can generate reasonably well from scratch, and the added complexity of maintaining a code index isn't worth it.

The real use case is edge deployment: offline environments, privacy-sensitive codebases, or devices where you can't run 13B+ models. A 2B editor beats no assistant at all.

For infrastructure teams, this matters if you're building internal developer tools. You can ship a locally-running code assistant that doesn't leak proprietary code to external APIs. The cost is maintaining the GitHub index and embedding pipeline, which is straightforward with Qdrant + a scheduled indexing job.

I'm running this setup in production for an internal CLI tool generator. Developers describe what they want, the system retrieves similar CLI implementations from our repos, and Phi-3 generates the modified version. It's not AGI, but it's useful.


This post is an excerpt from Practical AI Infrastructure Engineering — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at https://activ8ted.gumroad.com/l/ssmfkx


Originally published at fivenineslab.com

Top comments (0)