Training Small LLMs to Edit Code Instead of Generating It

#llm #mlops #aiinfrastructure

You've hit the wall with 2B parameter models trying to write functions from scratch. The output is syntactically broken, logically confused, or just hallucinates APIs that don't exist. But what if you stopped asking these models to be creative and instead treated them as intelligent diff generators?

I've run this exact experiment with Qwen2.5-Coder-1.5B and Phi-3-mini on an RTX 3060. The insight is simple: small models fail at generation but succeed at transformation. Give them a working reference implementation from GitHub and ask them to modify it for your specific use case. The model operates in the space of edits, not invention.

Why Small Models Fail at Code Generation

A 2B model has seen enormous amounts of code during pretraining, but it lacks the parameter capacity to reliably reproduce complex patterns. When you prompt "write a Redis connection pool in Python with retry logic", the model must:

Recall the Redis client API surface
Remember exception hierarchies
Generate retry backoff logic
Handle connection lifecycle edge cases
Produce syntactically valid, idiomatic Python

That's too many constraints for 2 billion parameters to satisfy simultaneously. You get code that looks plausible but fails on import redis or forgets to close connections.

But transformation is different. If you retrieve an existing Redis pool implementation and ask the model to "add exponential backoff to the retry logic", you've anchored it. The API calls are already there. The structure exists. The model only needs to insert a specific pattern it's seen hundreds of times.

The Retrieval + Edit Architecture

Here's the pipeline I tested with Phi-3-mini (3.8B) on an RTX 3060 Ti:

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Index GitHub code (one-time setup)
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
qdrant = QdrantClient(path="./code_index")

def index_github_snippets(repo_files):
    """Embed and store code snippets with metadata"""
    for file_path, content in repo_files:
        chunks = split_into_functions(content)  # Parse AST, extract functions
        embeddings = embedder.encode([c['code'] for c in chunks])
        qdrant.upsert(
            collection_name="code_snippets",
            points=[{
                "id": idx,
                "vector": emb.tolist(),
                "payload": {"code": chunk['code'], "file": file_path}
            } for idx, (emb, chunk) in enumerate(zip(embeddings, chunks))]
        )

# Retrieval + edit at inference
def edit_code_for_task(user_query):
    query_emb = embedder.encode(user_query)
    results = qdrant.search(
        collection_name="code_snippets",
        query_vector=query_emb,
        limit=3
    )

    reference_code = results[0].payload['code']

    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.float16,
        device_map="cuda"
    )
    tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

    prompt = f"""Edit this code to: {user_query}

Reference implementation:

python
{reference_code}


Modified version:"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.2)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

shell

The key is the prompt structure. You're not asking "write code to X". You're asking "here's code that does Y, modify it to do X". This constrains the solution space dramatically.

Performance Numbers on Low-End Hardware

On an RTX 3060 Ti (8GB VRAM), here's what I measured:

Phi-3-mini-4k-instruct (3.8B params, FP16)

Inference time: 2.1s for 256 tokens
VRAM usage: 7.2GB with batch size 1
Success rate (code runs without errors): 73% on my test set of 50 tasks

Qwen2.5-Coder-1.5B-Instruct (FP16)

Inference time: 1.3s for 256 tokens
VRAM usage: 3.1GB
Success rate: 61%

Compare this to the same models generating from scratch (no reference code):

Phi-3: 41% success rate
Qwen2.5-Coder: 29% success rate

The gap is huge. Editing existing code nearly doubles the reliability.

What This Actually Looks Like in Production

I deployed this as a VSCode extension prototype. The workflow:

User highlights code and types a natural language edit request
Extension embeds the request + existing code context
Searches local Qdrant index (seeded with 50k Python functions from popular repos)
Retrieves top-3 similar implementations
Sends reference code + edit instruction to local Phi-3 instance via llama.cpp server
Returns diff overlay in editor

The llama.cpp server runs with:

./server \
  -m phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 4096 \
  -ngl 35 \
  --host 0.0.0.0 \
  --port 8080

Quantization (Q4_K_M) drops VRAM to 2.4GB. Inference is 3.2s on an RTX 2060. That's fast enough for an interactive editing assistant.

The Limitations You'll Hit

This isn't a magic solution. The model still hallucinates when:

Retrieved code is too different from the target task (embedder failure)
Edit requires understanding distant context (small context windows)
Task involves proprietary APIs not in the training data

I found the sweet spot is refactoring, adding error handling, changing API versions, and adapting patterns. The model is bad at architectural decisions or designing new abstractions.

Also, code retrieval quality matters more than model size. A better embedding model (say, Salesforce/SFR-Embedding-Mistral) improves success rate by 8-12 percentage points. The model can only edit what you feed it.

Should You Build This?

If you're running on constrained hardware and need a local coding assistant, yes. The retrieval + edit pattern is the only way I've found to get reliable output from sub-4B models.

But if you have access to larger models (CodeLlama 13B, DeepSeek-Coder 6.7B), stick with those. They can generate reasonably well from scratch, and the added complexity of maintaining a code index isn't worth it.

The real use case is edge deployment: offline environments, privacy-sensitive codebases, or devices where you can't run 13B+ models. A 2B editor beats no assistant at all.

For infrastructure teams, this matters if you're building internal developer tools. You can ship a locally-running code assistant that doesn't leak proprietary code to external APIs. The cost is maintaining the GitHub index and embedding pipeline, which is straightforward with Qdrant + a scheduled indexing job.

I'm running this setup in production for an internal CLI tool generator. Developers describe what they want, the system retrieves similar CLI implementations from our repos, and Phi-3 generates the modified version. It's not AGI, but it's useful.

This post is an excerpt from Practical AI Infrastructure Engineering — a production handbook covering Docker, GPU infrastructure, vector databases, and LLM APIs. Full book with 4 hands-on capstone projects available at https://activ8ted.gumroad.com/l/ssmfkx

Originally published at fivenineslab.com