<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jared Pilcher</title>
    <description>The latest articles on DEV Community by Jared Pilcher (@jared_pilcher).</description>
    <link>https://dev.to/jared_pilcher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4011598%2F7a05de77-5ee7-47b9-ba76-f8e9be43de32.jpg</url>
      <title>DEV Community: Jared Pilcher</title>
      <link>https://dev.to/jared_pilcher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jared_pilcher"/>
    <language>en</language>
    <item>
      <title>I spent ten days forcing tiny local models to write real code. Here's what actually breaks.</title>
      <dc:creator>Jared Pilcher</dc:creator>
      <pubDate>Thu, 02 Jul 2026 04:16:00 +0000</pubDate>
      <link>https://dev.to/jared_pilcher/i-spent-ten-days-forcing-tiny-local-models-to-write-real-code-heres-what-actually-breaks-29k7</link>
      <guid>https://dev.to/jared_pilcher/i-spent-ten-days-forcing-tiny-local-models-to-write-real-code-heres-what-actually-breaks-29k7</guid>
      <description>&lt;p&gt;I had a thought a few weeks ago that wouldn't leave me alone - I depend on Claude Code every day. If it disappeared tomorrow, priced out, rate limited, whatever, I'd want a fallback I actually own. Not a cheaper subscription. Something that runs on my own hardware, forever, at zero marginal cost.&lt;/p&gt;

&lt;p&gt;So I started an experiment. A coding harness where every reasoning call goes to a tiny local model (Gemma 4 2B, served by llama.cpp on a Jetson Orin Nano), and the harness does everything it can to make up the difference. One hard rule - no cloud fallback, ever. If the small model can't do something, decompose the work or move it into deterministic code. Never escalate to a bigger model.&lt;/p&gt;

&lt;p&gt;The bet isn't mine alone. Projects like little-coder and NVIDIA's small-model research make the same wager - small models underperform agentic work because their harnesses are thin, not because the models are incapable. I wanted to find out exactly how true that is, with numbers I could trust. Ten days in, here's what I've learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The harness was throwing away right answers
&lt;/h2&gt;

&lt;p&gt;My biggest early win wasn't making the model smarter. It was noticing that about 60% of my failures were the model producing correct logic with broken indentation. The module wouldn't even import, so it scored as a fail. The right answer was sitting there and my harness was discarding it over whitespace.&lt;/p&gt;

&lt;p&gt;The fix - only when the output fails to parse, ask the model to re-indent its own code, logic untouched. That one change took my bar from 64 to 76 out of 100 - and the gain held on the 50 problems I'd never tuned against (31 to 38), which is the half I trust.&lt;/p&gt;

&lt;p&gt;If you take one thing from this post - before you conclude a small model can't do something, check whether your harness is throwing away the times it did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Never let a small model decide what to do. Only what to write.
&lt;/h2&gt;

&lt;p&gt;I watched the 2B fail a multi-step task in a way that changed how I build. It wasn't that it couldn't write the fix - it could. Its plan just never included the fix step. It planned around the actual work.&lt;/p&gt;

&lt;p&gt;For a small model, open-ended planning ("what steps should I take?") is close to the least reliable thing you can ask. Filling a bounded blank ("make this stubbed function pass this test") is close to the most reliable. So I stopped asking it to plan. The control flow is now a deterministic program and the model only fills slots. On my multi-step scenarios that took it from 2/3 to 3/3. Three scenarios, so I'm not calling that statistics - but the failure mode was clear and reproducible.&lt;/p&gt;

&lt;p&gt;Related rule that's now non-negotiable - the test exit code is the only judge. The model saying "looks good" counts for nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Small models are decent writers and terrible judges
&lt;/h2&gt;

&lt;p&gt;I tried adding a review step - the model checks its own passing solution against the spec and revises if it finds a gap. Standard self-reflection stuff, everyone does it.&lt;/p&gt;

&lt;p&gt;It made things worse. The 2B took a solution that passed its tests, declared there was a gap, and rewrote it into one that failed. The same review-then-commit pattern works fine when I run it with a large model. So "model as judge" isn't a pattern that's good or bad - it has a capability threshold, and a 2B is below it. I haven't seen that stated plainly anywhere, probably because almost nobody runs the review pattern on models this small and measures what happens.&lt;/p&gt;

&lt;p&gt;Most of my good ideas were wrong&lt;br&gt;
Things that did nothing or made it worse, on held-out problems - more context (flat), few-shot examples (zero-shot beat it), retrieval-augmented examples (flat), best-of-N sampling (pure noise). At one point run-to-run noise made a genuinely net-negative prompt change look like a +6% win. That scared me into building a deterministic, temp-0, held-out eval before touching anything else. Cheap insurance. Every claim in this post survived it; most of my ideas didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then I hit the real wall
&lt;/h2&gt;

&lt;p&gt;HumanEval-style single functions are the easy tier - my harness now does well there (and honestly, those benchmarks are probably in every model's training data anyway, so I only trust the paired deltas, not the absolute scores).&lt;/p&gt;

&lt;p&gt;Real repositories are a different sport. I built a commit-replay eval - take a real project's git history, keep only commits where the repo's own tests go red-to-green, and ask the harness to reproduce the change from the commit message alone. Test hidden, no leakage, scored in Docker.&lt;/p&gt;

&lt;p&gt;Mining one library's last 400 commits left 37 that were cleanly checkable this way (most commits don't come with a test that pins the change - that filter ratio was a finding in itself). One-shot result - 1 of 37, call it 3%. A second repo came in at 0 for 11, so it's not a quirk of one codebase. After a structural fix (apply every function the commit changed, not just the first) - 4 of 37, about 11%. Then a structured spec-first flow - the model writes a behavior spec from the intent, then its own tests from the spec, then code against those tests, with the real oracle still hidden - took it to 6 of 37, about 16%. On seven of the hard cases I ran a sampling probe - twenty samples each at fair temperature, zero correct. The right answer isn't in the model's distribution at all. That's not a selection problem or a prompting problem. It's a genuine generation wall.&lt;/p&gt;

&lt;p&gt;That gap - 80%+ on single functions, ~10% on real commits - is the actual frontier for small models, and I don't see anyone publishing it honestly. (If anything, benchmark contamination inflates the first number, which makes the real gap wider.)&lt;/p&gt;

&lt;h2&gt;
  
  
  So now it's a multi-model system
&lt;/h2&gt;

&lt;p&gt;If one 2B has a wall, maybe several small models with different walls can cover for each other. I added Qwen's 3B coder and profiled both per problem class. On standalone function generation it genuinely beats Gemma - 65% vs 48% on MBPP (48% is Gemma's best mode; 25% without its reasoning gate). I used MBPP for this specifically after checking that the edge wasn't just HumanEval contamination. Routing that class to Qwen is the first clean multi-model win.&lt;/p&gt;

&lt;p&gt;But here's the finding that matters more - on the hard repo class, Qwen fails the exact same problems Gemma does. Two similar models have correlated failures. Adding a second similar model buys you nothing on the wall - you need models that are actually different, not just more of them. I'm auditioning Phi-4 next for exactly that reason.&lt;/p&gt;

&lt;p&gt;And selection is deterministic - the test gate picks the winner. Never a model judging another model - see above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I'm doing this
&lt;/h2&gt;

&lt;p&gt;Partly because budgets come due. We're all building on subsidized inference, and when that ends, "cheapest model that clears the bar" becomes a real engineering discipline. Partly sovereignty - a future where every small team rents cognition from three companies isn't the future I want. Hardware you can buy for a few hundred dollars, running models you own, is the alternative - if we can prove it's good enough.&lt;/p&gt;

&lt;p&gt;But mostly because mapping the limit is the interesting part. Not "can a tiny model match a frontier model" (it can't, in general) but -  which parts of real development collapse into work a tiny model can do inside the right harness, and which parts are genuinely out of reach? I'm building that map, model by model, and I'll keep publishing what I find - including the failures, which so far have taught me more than the wins.&lt;/p&gt;

&lt;p&gt;The code isn't public yet - I want the first version of the model map done before I open it up. If this is your kind of problem, follow along, or tell me what I'm getting wrong. I'd genuinely like to know.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
