Last post we stood up Ollama on the RTX 5090, pulled a stack of models, and wired them into our coding workflow. The whole time there was an obvious question hanging over it: are local models actually good enough?
Not good enough in the abstract benchmarks-on-a-leaderboard sense. Good enough for the thing weβre journaling: vibe coding. Specifically, can a model running on consumer hardware in my homelab produce code that's as correct, as fast, and as complete as what comes back from Anthropic's cloud?
We built a benchmark to find out.
The Setup
Six models, one prompt, no second chances.
Cloud (Anthropic API):
- Sonnet 4.6 (
claude-sonnet-4-20250514) - Opus 4.6 (
claude-opus-4-20250514)
Local (Ollama on RTX 5090, 32 GB VRAM):
- Codestral 22B (
codestral:22b) - DeepSeek R1 14B (
deepseek-r1:14b) - Devstral (
devstral:latest) - Qwen 3.5B MoE (
qwen3.5:35b-a3b)
The prompt was intentionally straightforward: build a Python CLI todo app with SQLite persistence, CRUD commands (add, list, complete, delete), timestamps, pretty output, error handling, and a __main__ block. The kind of task that shows up in real work. A simple "write a small, complete program."
Every model got the exact same prompt with the instruction: "Respond with ONLY the Python code, no explanation."
We measured:
- Time to first token (TTFT): how long before output starts streaming
- Total generation time: wall clock from request to last token
- Output tokens: how much the model wrote
- Tokens per second: raw generation throughput
- Validation: does it parse, does it have all the features, does it actually run through a functional test suite of 7 operations (add two todos, list, complete one, list again, delete one, list again)
The Results: Performance
| Model | Type | TTFT | Total Time | Output Tokens | Tok/s |
|---|---|---|---|---|---|
| Sonnet 4.6 | Cloud | 0.87s | 14.89s | 1,461 | 104.2 |
| Opus 4.6 | Cloud | 1.23s | 19.06s | 1,324 | 74.3 |
| Codestral 22B | Local | 15.81s | 22.11s | 620 | 98.5 |
| DeepSeek R1 | Local | 11.74s | 20.64s | 1,707 | 191.7 |
| Devstral | Local | 2.24s | 10.26s | 723 | 90.2 |
| Qwen 3.5B | Local | 28.20s | 30.91s | 4,096 | 1,510.2 |
A few things jump out immediately. Devstral finished faster than every other model, cloud or local. Qwen's tokens-per-second number is absurd. And DeepSeek R1 produced the most tokens despite writing roughly the same amount of code as (more on why in a minute).
The Results: Quality
Performance doesn't matter if the code is wrong. Here's how each model scored:
| Model | Syntax Valid | Features (X/10) | Functional (X/7) | Score |
|---|---|---|---|---|
| Sonnet 4.6 | Yes | 10/10 | 7/7 | 100 |
| Opus 4.6 | Yes | 10/10 | 7/7 | 100 |
| Devstral | Yes | 10/10 | 7/7 | 100 |
| Codestral 22B | Yes | 10/10 | 0/7 | 60 |
| DeepSeek R1 | Yes | 10/10 | 0/7 | 60 |
| Qwen 3.5B | No | 7/10 | 0/7 | 28 |
Three perfect scores. Two models that wrote valid code that didn't pass functional tests. One that didn't even produce valid Python.
Let's talk about what happened.
What Went Wrong (and Right)
The Interactive Menu Problem
Codestral 22B and DeepSeek R1 both scored 10/10 on features. Their code had SQLite, all four CRUD operations, timestamps, completion tracking, error handling, a main block, and pretty output. On paper, they nailed it.
The problem: both interpreted "Commands: add, list, complete, delete" as an interactive menu application. They built while True loops with input() prompts instead of CLI argument parsers.
Codestral's approach:
while True:
action = input("Enter a command (add, list, complete, delete): ")
DeepSeek R1 went even further, building an entire menu system:
print("Todo App Menu:")
print(" ---")
print("add - Add a new todo ")
print("list - List all todos ")
cmd = input("Enter command: ").strip().lower()
Both are perfectly valid interpretations of "commands." Both produced clean, working code. But our automated test suite calls the script with command-line arguments (python todo.py add "Buy groceries"), not interactive input. The scripts immediately hit EOFError: EOF when reading a line because there's no stdin to read from.
This is arguably a prompt clarity issue, not a model quality issue. If the prompt had said "using argparse" or "using sys.argv," both models would have nailed it. But the three models that scored 100 all inferred CLI arguments without being told, which is the more common pattern for "command-line app" in the training data.
The Token Limit Trap
Qwen 3.5B is fascinating and frustrating in equal measure.
That 1,510 tokens-per-second number is real. The model uses a Mixture of Experts (MoE) architecture: 35 billion total parameters, but only ~3 billion active per token. The RTX 5090 tears through it. In pure generation speed, nothing else comes close.
But it hit the 4,096 output token limit mid-f-string:
print(f"{'ID':<5} {'Status':<8} {'Title':<40} {'Created At'}")
print('-' * 70)
for row in rows:
id_, title, created_at, completed = row
status = "[X]" if completed else "[ ]"
print(f"{
That's it. The code cuts off right there. No closing quote, no remaining functions, no main block. The syntax is invalid. The features for complete, delete, and __main__ are missing because the model never got to write them.
The speed is meaningless if the output is incomplete. The lesson: always set generous max_tokens for code generation tasks. A 4,096 limit that's fine for chat responses will absolutely truncate a complete program. We should have set 8,192 or higher. That's on us.
DeepSeek R1's Thinking Tax
DeepSeek R1 produced 1,707 output tokens, the most of any model, but its actual code was only 156 lines. Where did the extra tokens go?
Into <think> blocks. DeepSeek R1 is a reasoning model. Before writing code, it spends tokens working through the problem:
"Let me think about how to structure this... I need SQLite for persistence... I'll use a class-based approach with a menu system..."
This is genuinely useful for hard debugging problems or complex architectural decisions. But for straightforward code generation where the answer is obvious, it's wasted compute. You're paying (in time and tokens) for the model to reason through something it could just write directly.
Devstral's Quiet Dominance
The standout result of the entire benchmark. Devstral is a 24B parameter model from Mistral, purpose-built for coding tasks. On paper it's smaller than some of the competition. In practice:
- Fastest total time: 10.26 seconds, beating even the cloud models
- Best local TTFT: 2.24 seconds, nearly as fast as cloud cold-start
- Perfect score: 100/100 on quality
- Clean architecture: argparse-based CLI, exactly what the test expected
It didn't overthink it. It didn't build a menu system. It didn't run out of tokens. It just wrote a clean, correct, well-structured todo app and moved on.
Code Comparison: The Three Perfect Scores
All three 100-score models (Sonnet, Opus, Devstral) used argument-based CLI patterns, but their implementations differ in interesting ways.
Sonnet 4.6 went with argparse and a class-based design. 149 lines. Full docstrings, type hints, and emoji-rich output with status indicators:
class TodoApp:
def __init__(self, db_path: str = "todos.db"):
self.db_path = db_path
self.init_database()
def add_todo(self, title: str):
# ...
print(f"β
Added todo #{todo_id}: {title.strip()}")
Opus 4.6 also used a class with sys.argv parsing instead of argparse. 157 lines. It used sqlite3.Row for named column access and a manual usage printer. More defensive with explicit connection closing in a finally block:
class TodoApp:
def __init__(self, db_path: str = "todos.db"):
self.conn = sqlite3.connect(db_path)
self.conn.row_factory = sqlite3.Row
self.create_table()
Devstral took the most minimal approach. 99 lines. Flat functions instead of a class, argparse with subparsers, CURRENT_TIMESTAMP in SQL instead of Python-side datetime generation. No emoji, no decorations, just clean output:
def list_todos():
# ...
for row in rows:
status = "[x]" if row[3] else "[ ]"
print(f"{row[0]:<3} {status} {row[1]} (created at: {row[2]})")
The style differences are telling. Sonnet writes like a senior developer who cares about DX. Opus writes like someone who's been burned by unclosed connections before. Devstral writes like someone who bills by the hour and values conciseness. All three approaches are correct. Devstral's is the one I'd actually want in a codebase.
import sqlite3
import argparse
import datetime
from typing import List, Tuple
class TodoApp:
def __init__(self, db_path: str = "todos.db"):
self.db_path = db_path
self.init_database()
def init_database(self):
"""Initialize the SQLite database and create the todos table if it doesn't exist."""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS todos (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
created_at TEXT NOT NULL,
completed INTEGER NOT NULL DEFAULT 0
)
""")
conn.commit()
conn.close()
def add_todo(self, title: str):
"""Add a new todo item."""
if not title.strip():
print("β Error: Todo title cannot be empty")
return
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
created_at = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
cursor.execute(
"INSERT INTO todos (title, created_at, completed) VALUES (?, ?, 0)",
(title.strip(), created_at)
)
conn.commit()
todo_id = cursor.lastrowid
conn.close()
print(f"β
Added todo #{todo_id}: {title.strip()}")
def list_todos(self):
"""List all todo items."""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("SELECT id, title, created_at, completed FROM todos ORDER BY id")
todos = cursor.fetchall()
conn.close()
if not todos:
print("π No todos found. Add some with: python todo.py add \"Your task\"")
return
print("\nπ Your Todos:")
print("-" * 60)
for todo_id, title, created_at, completed in todos:
status = "β" if completed else "β"
status_text = "DONE" if completed else "TODO"
print(f"{status} #{todo_id:<3} [{status_text}] {title}")
print(f" π
Created: {created_at}")
print("-" * 60)
print(f"Total: {len(todos)} todos")
def complete_todo(self, todo_id: int):
"""Mark a todo as completed."""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("SELECT id, title, completed FROM todos WHERE id = ?", (todo_id,))
todo = cursor.fetchone()
if not todo:
print(f"β Error: Todo #{todo_id} not found")
conn.close()
return
if todo[2]:
print(f"βΉοΈ Todo #{todo_id} is already completed")
conn.close()
return
cursor.execute("UPDATE todos SET completed = 1 WHERE id = ?", (todo_id,))
conn.commit()
conn.close()
print(f"β
Completed todo #{todo_id}: {todo[1]}")
def delete_todo(self, todo_id: int):
"""Delete a todo item."""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("SELECT id, title FROM todos WHERE id = ?", (todo_id,))
todo = cursor.fetchone()
if not todo:
print(f"β Error: Todo #{todo_id} not found")
conn.close()
return
cursor.execute("DELETE FROM todos WHERE id = ?", (todo_id,))
conn.commit()
conn.close()
print(f"ποΈ Deleted todo #{todo_id}: {todo[1]}")
def main():
app = TodoApp()
parser = argparse.ArgumentParser(description="A simple command-line todo application")
subparsers = parser.add_subparsers(dest="command", help="Available commands")
add_parser = subparsers.add_parser("add", help="Add a new todo")
add_parser.add_argument("title", help="Todo title")
list_parser = subparsers.add_parser("list", help="List all todos")
complete_parser = subparsers.add_parser("complete", help="Mark a todo as completed")
complete_parser.add_argument("id", type=int, help="Todo ID to complete")
delete_parser = subparsers.add_parser("delete", help="Delete a todo")
delete_parser.add_argument("id", type=int, help="Todo ID to delete")
args = parser.parse_args()
try:
if args.command == "add":
app.add_todo(args.title)
elif args.command == "list":
app.list_todos()
elif args.command == "complete":
app.complete_todo(args.id)
elif args.command == "delete":
app.delete_todo(args.id)
else:
parser.print_help()
except Exception as e:
print(f"β Error: {str(e)}")
if __name__ == "__main__":
main()
import sqlite3
import sys
from datetime import datetime
from typing import List, Tuple
class TodoApp:
def __init__(self, db_path: str = "todos.db"):
self.conn = sqlite3.connect(db_path)
self.conn.row_factory = sqlite3.Row
self.create_table()
def create_table(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS todos (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
created_at TIMESTAMP NOT NULL,
completed BOOLEAN NOT NULL DEFAULT 0
)
""")
self.conn.commit()
def add(self, title: str) -> None:
if not title.strip():
print("Error: Todo title cannot be empty")
return
try:
self.conn.execute(
"INSERT INTO todos (title, created_at) VALUES (?, ?)",
(title, datetime.now())
)
self.conn.commit()
print(f"β Added todo: '{title}'")
except Exception as e:
print(f"Error: Failed to add todo - {e}")
def list(self) -> None:
try:
cursor = self.conn.execute(
"SELECT * FROM todos ORDER BY created_at DESC"
)
todos = cursor.fetchall()
if not todos:
print("No todos found. Add one with 'add <title>'")
return
print("\n" + "="*60)
print("TODO LIST")
print("="*60)
for todo in todos:
status = "β" if todo["completed"] else "β"
created = datetime.fromisoformat(todo["created_at"]).strftime("%Y-%m-%d %H:%M")
print(f"{status} [{todo['id']}] {todo['title']}")
print(f" Created: {created}")
print("-"*60)
except Exception as e:
print(f"Error: Failed to list todos - {e}")
def complete(self, todo_id: str) -> None:
try:
todo_id = int(todo_id)
except ValueError:
print("Error: Invalid todo ID. Must be a number.")
return
try:
cursor = self.conn.execute(
"UPDATE todos SET completed = 1 WHERE id = ?",
(todo_id,)
)
if cursor.rowcount == 0:
print(f"Error: Todo with ID {todo_id} not found")
else:
self.conn.commit()
print(f"β Completed todo {todo_id}")
except Exception as e:
print(f"Error: Failed to complete todo - {e}")
def delete(self, todo_id: str) -> None:
try:
todo_id = int(todo_id)
except ValueError:
print("Error: Invalid todo ID. Must be a number.")
return
try:
cursor = self.conn.execute(
"DELETE FROM todos WHERE id = ?",
(todo_id,)
)
if cursor.rowcount == 0:
print(f"Error: Todo with ID {todo_id} not found")
else:
self.conn.commit()
print(f"β Deleted todo {todo_id}")
except Exception as e:
print(f"Error: Failed to delete todo - {e}")
def close(self):
self.conn.close()
def print_usage():
print("""
Usage:
python todo.py add <title> - Add a new todo
python todo.py list - List all todos
python todo.py complete <id> - Mark a todo as completed
python todo.py delete <id> - Delete a todo
""")
if __name__ == "__main__":
app = TodoApp()
if len(sys.argv) < 2:
print_usage()
sys.exit(1)
command = sys.argv[1].lower()
try:
if command == "add":
if len(sys.argv) < 3:
print("Error: Please provide a todo title")
print("Usage: python todo.py add <title>")
else:
title = " ".join(sys.argv[2:])
app.add(title)
elif command == "list":
app.list()
elif command == "complete":
if len(sys.argv) < 3:
print("Error: Please provide a todo ID")
print("Usage: python todo.py complete <id>")
else:
app.complete(sys.argv[2])
elif command == "delete":
if len(sys.argv) < 3:
print("Error: Please provide a todo ID")
print("Usage: python todo.py delete <id>")
else:
app.delete(sys.argv[2])
else:
print(f"Error: Unknown command '{command}'")
print_usage()
sys.exit(1)
finally:
app.close()
import sqlite3
from datetime import datetime
import argparse
import sys
DATABASE = 'todos.db'
def init_db():
conn = sqlite3.connect(DATABASE)
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS todos (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed BOOLEAN NOT NULL CHECK (completed IN (0, 1))
)''')
conn.commit()
conn.close()
def add_todo(title):
conn = sqlite3.connect(DATABASE)
c = conn.cursor()
c.execute("INSERT INTO todos (title, completed) VALUES (?, ?)", (title, False))
conn.commit()
conn.close()
def list_todos():
conn = sqlite3.connect(DATABASE)
c = conn.cursor()
c.execute("SELECT id, title, created_at, completed FROM todos ORDER BY completed, created_at")
rows = c.fetchall()
conn.close()
for row in rows:
status = "[x]" if row[3] else "[ ]"
print(f"{row[0]:<3} {status} {row[1]} (created at: {row[2]})")
def complete_todo(id):
conn = sqlite3.connect(DATABASE)
c = conn.cursor()
c.execute("UPDATE todos SET completed = 1 WHERE id = ?", (id,))
rows_affected = c.rowcount
conn.commit()
conn.close()
if rows_affected == 0:
raise ValueError(f"Todo with ID {id} not found.")
def delete_todo(id):
conn = sqlite3.connect(DATABASE)
c = conn.cursor()
c.execute("DELETE FROM todos WHERE id = ?", (id,))
rows_affected = c.rowcount
conn.commit()
conn.close()
if rows_affected == 0:
raise ValueError(f"Todo with ID {id} not found.")
def main():
parser = argparse.ArgumentParser(description='CLI todo application')
subparsers = parser.add_subparsers(dest='command')
add_parser = subparsers.add_parser('add', help='Add a new todo')
add_parser.add_argument('title', type=str, help='Todo title')
list_parser = subparsers.add_parser('list', help='List all todos')
complete_parser = subparsers.add_parser('complete', help='Mark a todo as completed')
complete_parser.add_argument('id', type=int, help='TODO ID to mark as completed')
delete_parser = subparsers.add_parser('delete', help='Delete a todo')
delete_parser.add_argument('id', type=int, help='TODO ID to delete')
args = parser.parse_args()
init_db()
try:
if args.command == 'add':
add_todo(args.title)
elif args.command == 'list':
list_todos()
elif args.command == 'complete':
complete_todo(args.id)
elif args.command == 'delete':
delete_todo(args.id)
else:
parser.print_help()
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
if __name__ == '__main__':
main()
Speed Analysis
The performance numbers tell two very different stories depending on what you care about.
For interactive chat and streaming, TTFT is what matters. Cloud models dominated here. Sonnet 4.6 started streaming in 0.87 seconds. Opus in 1.23 seconds. You ask a question, you immediately see output. That responsiveness is a big part of why cloud models feel fast even when their total generation time is longer.
Local models have a fundamentally different cost model. TTFT includes model loading time, and on first request after a cold start, that loading time is significant:
- Devstral: 2.24s TTFT (best local, model stays warm in VRAM)
- DeepSeek R1: 11.74s (14B params loading into VRAM)
- Codestral 22B: 15.81s (22B params, larger model footprint)
- Qwen 3.5B: 28.20s (35B total params, 23 GB model loading from disk into VRAM despite only 3B active)
Qwen's 28-second TTFT is brutal for interactive use. You type a prompt and wait half a minute before anything appears. The MoE architecture means the full model weight file is enormous even though inference is fast once loaded.
For batch processing and code generation, total time and throughput matter more than TTFT. And here, the picture flips. Devstral at 10.26 seconds total beat both cloud models. Once the local models are loaded and generating, their token throughput is competitive:
| Model | Tok/s | Context |
|---|---|---|
| Qwen 3.5B | 1,510.2 | MoE architecture, 3B active params |
| DeepSeek R1 | 191.7 | Includes reasoning tokens |
| Sonnet 4.6 | 104.2 | Cloud, shared infrastructure |
| Codestral 22B | 98.5 | Full 22B model on single GPU |
| Devstral | 90.2 | 24B model, balanced speed/quality |
| Opus 4.6 | 74.3 | Cloud, larger model |
Devstral found the sweet spot: fast enough TTFT to feel responsive, fast enough generation to beat the cloud on wall-clock time, and high enough quality to score perfectly. It's the model that made me stop thinking of local inference as a compromise.
The Verdict
For production coding tasks: Sonnet 4.6 or Devstral. Sonnet if you're already in the Anthropic ecosystem and want sub-second TTFT. Devstral if you want the same quality with zero API costs, zero rate limits, and total data privacy. Both scored 100. Devstral was actually faster end-to-end.
Opus 4.6 is capable but slower and more expensive for no quality gain on this task. Its strengths show on harder problems: multi-file refactors, complex debugging, architectural decisions. For straightforward code generation, you're paying a premium for capability you don't need.
Codestral 22B and DeepSeek R1 aren't bad models. They wrote valid, working code. The "failure" was a prompt interpretation issue that a single clarifying word would have fixed. In a conversational coding session where you can follow up, both would have corrected course immediately.
Qwen 3.5B is a speed demon trapped by token limits. At 1,510 tok/s it's the fastest generator by an order of magnitude, but that speed is wasted if you cap output too low. With proper max_tokens settings and the right tasks (short functions, completions, refactors), it could be the best option for high-throughput local work. We'll retest with higher limits.
The real takeaway isn't about which model "won." It's that the prompt matters as much as the model. Two models scored 60 because of a single ambiguous word in the prompt. One model scored 28 because of a configuration parameter. The gap between cloud and local quality has effectively closed for focused coding tasks. The remaining differences are in speed characteristics, token economics, and how forgiving the model is when your prompt isn't perfectly specific.
Local LLMs on consumer hardware aren't a compromise anymore. They're a legitimate option. Devstral proved it.
What's Next: Round 2
This was a useful first benchmark, but it was also a simple one. A single-file todo app with a clear spec is the kind of task where every model should do well. The interesting question is what happens when you make it harder.
Round 2 will use a more complex task: multi-file, with tests, with ambiguous requirements that force the model to make architectural decisions. We'll also adjust based on what we learned here. The prompt will be more explicit (no more "CLI or web" ambiguity that tripped up two models), and we'll give every model a larger context window and higher token limits so no one gets cut off mid-line.
We're also adding two models that came as requests from the Coder dev team:
- Kimi K2.6 (Moonshot AI). A 1T-parameter MoE model with 32B active parameters and 256K context. It's getting strong benchmark scores and has native tool-calling support. The catch: even the most aggressively quantized version needs ~240 GB of memory, which is well beyond what the homelab can handle locally. We'll need to test this one via API.
- Gemma 4 (Google). We need to research the available sizes and quantizations to see what fits on 32 GB of VRAM. If there's a version in the 14B-27B range, it could slot in alongside Devstral and Qwen as another local contender.
Both additions will be interesting tests of whether the "local models are good enough" conclusion holds with a harder prompt, and whether the newer model generation has closed the gap further.
By the Numbers
- 6 models benchmarked
- 1 prompt, identical across all models
- 3 perfect scores (Sonnet 4.6, Opus 4.6, Devstral)
- 2 models that built the wrong kind of app
- 1 model that ran out of tokens mid-f-string
- 10.26 seconds for Devstral to write a complete, working todo app
- 1,510 tokens per second from Qwen 3.5B (fastest local generation)
- 0.87 seconds for Sonnet 4.6's first token (fastest TTFT)
- 28.2 seconds for Qwen's first token (slowest TTFT)
- 4,096 token limit that killed an otherwise promising run
- 32 GB of VRAM making all of this possible on a single GPU
- 0 API costs for the three local models
Top comments (0)