DEV Community

Matt Macosko
Matt Macosko

Posted on • Originally published at nicedreamzwholesale.com

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.

I just ran HumanEval on it. Wi-Fi off the entire run.

  • 81.7% pass@1 on the full 164-problem benchmark
  • Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX)
  • 14 minutes wall-clock, $0/month after the model download

YouTube walkthrough (three real problems, code streaming live, tests going green):
https://www.youtube.com/watch?v=muq7VdgxqRk

Why this number matters

The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).

Methodology

Setting Value
Benchmark HumanEval — 164 Python tasks (full)
Metric pass@1 (first attempt only)
Temperature 0.0 — deterministic
Sampling single sample per problem, no best-of-N
Execution Python subprocess, 10s timeout
Hardware M5 Max MacBook Pro · 128 GB unified memory
Model Qwen3-Coder-30B-A3B-Instruct-MLX-8bit
Network Wi-Fi OFF the entire run
Wall clock 14 minutes

For context — Qwen3-Coder 480B's official agentic benchmarks

The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):

Benchmark Qwen3-Coder 480B Claude Sonnet 4 GPT-4.1
SWE-bench Verified (500-turn) 69.6 70.4
Terminal-Bench 37.5 35.5 25.3
BFCL-v3 68.7 73.3 62.9
Aider-Polyglot 61.8 56.4 52.4

Source: Qwen team's official blog.

Why the offline part matters

If a tool needs the internet, three things are true:

  1. Someone else can read what you sent.
  2. Someone else can charge you for it.
  3. Someone else can take it away.

If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.

Reproduce it yourself

For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt


Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Top comments (0)