Christopher Maher

Posted on Apr 27 • Originally published at llmkube.com

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

#ai #llm #kubernetes #opensource

Originally published at llmkube.com/blog/m5-max-aider-polyglot-and-finops. Cross-posted here for the dev.to audience.

A 24-hour Aider Polyglot run, a follow-up bench that blew up in interesting ways, and a working $/MTok number from a Kubernetes operator that scrapes Apple Silicon power live. Two open-source PRs landed today to make all of this reproducible on any M-series Mac.

This is a coding-model benchmark on locally-served weights, plus a FinOps story. Every benchmark number traces to results files we can show you. Every cost number traces to a CSV captured by InferCost during the run. The point is the methodology and the tooling; the model rankings are along for the ride.

TL;DR

Qwen3.6-35B-A3B Q8 (Tongyi Lab, Apache 2.0) hit 62.2% on Aider Polyglot (pass_rate_2, n=225/225) running locally on a MacBook Pro M5 Max via LLMKube's Metal Agent. That places it above Claude Sonnet 4 with 32k thinking budget (61.3%), o1-high (61.7%), DeepSeek R1 original (56.9%), and Claude 3.5 Sonnet (51.6%) on the official Aider leaderboard. It also beats every published Qwen-family entry on the Polyglot board.
Devstral-Small-2-2512 Q8 (Mistral, Apache 2.0) hit 4% on Aider Polyglot diff format, 8% on Aider Polyglot whole format, and 81.7% on HumanEval+ (164 problems, all passed standard). Same model. 20× swing. Benchmark numbers don't transfer across harnesses, and you should never quote one without naming the other.
InferCost ran the whole time. The new Apple Silicon collector (shipped in InferCost v0.3.0) reconciled $0.18/hr against the apple-m5-max CostProfile, with InferCost's reading agreeing with the LLMKube agent's direct gauge within 1.6 W mean delta over the Qwen window. First widely-published $/MTok number for an Apple Silicon LLM workload that traces to a real Prometheus scrape.
Two releases shipped alongside this post make all of it reproducible on your own Mac: LLMKube v0.7.2 (Apple power gauges via powermetrics, security-hardened sudoers, and a one-command make install-powermetrics-sudo) and InferCost v0.3.0 (Metal collector, condition reporting, sample CostProfile).

1. The hardware and what's special about it

The bench machine is a MacBook Pro M5 Max, 2026 model:


GPU	40-core integrated, Metal 4
CPU	18-core (6 P-cores, 12 E-cores)
Unified memory	128 GB
Memory bandwidth	614 GB/s
OS	macOS 25.4 (Darwin)
Price	About $4,500 fully configured

Source: Apple newsroom.

The 614 GB/s bandwidth is the constraint that decides everything that follows. For a dense 24B model at Q8, you need to read about 25 GB per generated token, so the upper bound is 614 / 25 = 24.56 t/s and we measured 24 t/s, within 2.3% of the wall. For a MoE like Qwen3.6-35B-A3B, only the active 3B parameters read per token, so the wall is ~200 t/s and you actually get to choose how to spend the bandwidth. That's the whole story behind why MoE feels fast on a Mac.

Stack: LLMKube v0.7.x with the Metal Agent feature branch from PR #334 cherry-picked in (now main), llama-server from llama.cpp Metal, and a kind cluster on the same host for the K8s control plane. InferCost was running locally via go run ./cmd/main.go, pointed at the LLMKube agent's /metrics endpoint via a new --metal-endpoint flag.

2. Qwen3.6-35B-A3B Q8 on Aider Polyglot

The Qwen3.6 family includes a dense 27B and an MoE variant at 35B total / 3B active per token. We ran the MoE quantized to Q8_0 (~36 GB on disk, fits comfortably in 128 GB unified memory with room for KV cache and the rest of macOS).

Aider Polyglot is a 225-problem benchmark across C++, Go, Java, JavaScript, Python, and Rust, designed to keep top frontier coding LLMs in the 5-50% range. Each model gets two attempts per problem: a single-shot solve, and a second attempt after seeing the failed test output. The headline metric is pass_rate_2, the percentage of problems that passed all tests within those two attempts.

Aider was driven from inside a Docker container (aider-benchmark image) talking to llama-server via OPENAI_API_BASE=http://host.docker.internal:<port>/v1. Edit format was diff (Aider's standard for capable models). Threads = 4. The model id we passed to LiteLLM was openai/Qwen3.6-35B-A3B-Q8_0.gguf, the basename llama-server reports.

The full run took 49.9 hours of inference wall-clock time stretched across about 24 hours of real time, plus a follow-up resume cycle to handle a runaway-reasoning failure mode. More on that in §4.

The headline result

pass_rate_2 = 62.2% (140 of 225), pass_rate_1 = 34.7% (78 of 225).

Verified against the official Aider Polyglot leaderboard yaml pulled today, here's where that lands among the published baselines:

pass_rate_2	format	model
88.0%	diff	gpt-5 (high)
84.9%	diff	o3-pro (high)
81.3%	diff	o3 (high)
79.6%	diff	grok-4 (high)
72.0%	diff	Claude Opus 4 (32k thinking)
71.4%	diff	DeepSeek R1 (0528)
64.9%	diff	Claude 3.7 Sonnet (32k thinking)
64.0%	architect	DeepSeek R1 + Claude 3.5 Sonnet
62.2%	diff	Qwen3.6-35B-A3B Q8 (this run, M5 Max, Apache 2.0, ours)
61.7%	diff	o1-2024-12-17 (high)
61.3%	diff	Claude Sonnet 4 (32k thinking)
60.4%	diff	Claude 3.7 Sonnet (no thinking)
59.6%	diff	Qwen3 235B A22B (no think, Alibaba API)
56.9%	diff	DeepSeek R1 (original)
56.4%	diff	Claude Sonnet 4 (no thinking)
51.6%	diff	Claude 3.5 Sonnet
40.0%	diff	Qwen3 32B
8.0%	diff	Qwen2.5-Coder-32B

The defensible reads:

Beats Claude Sonnet 4 with 32k thinking budget by 0.9 percentage points.
Beats o1-high by 0.5 percentage points.
Beats DeepSeek R1 original by 5.3 percentage points.
Beats Claude 3.5 Sonnet by 10.6 percentage points.
Within 2.7 points of Claude 3.7 Sonnet (32k thinking).
Strongest open-weights Qwen-family number on the Polyglot leaderboard. Qwen3 32B sat at 40.0%, Qwen3 235B A22B at 59.6%. The 35B-A3B MoE quantization is doing real work for its size.

What we are not claiming: that this beats Opus 4, GPT-5, o3, or DeepSeek V3.2-Exp Reasoner. Those all sit above us on the leaderboard. Qwen3.6 is in the same band as Sonnet 4 thinking, not in the band with o3-high or GPT-5.

Per-language

Language	n	pass_1	pass_2	p2 %	avg min/exercise
python	34	14	25	73.5%	4.8
javascript	49	20	35	71.4%	5.3
go	39	16	24	61.5%	6.2
rust	30	14	17	56.7%	9.2
cpp	26	4	14	53.8%	21.7
java	47	10	25	53.2%	31.9

Two things worth noting. First, Python and JavaScript at ~73% looks like clean Sonnet-3.5-thinking territory on the languages most developers actually use Aider for. Second, Java at 31.9 minutes per exercise on average is inflated by the runaway-reasoning case described next. Strip the outlier and Java's average is in line with C++.

3. The runaway-reasoning failure mode (and the resume that closed it out)

About 21 hours into the run, the container settled into a Java exercise that consumed 80 minutes of wall time without writing a new result file or producing meaningful output. The log mtime stayed frozen, the container stayed "Up," and the model was clearly deep in a reasoning loop with no exit strategy. We stopped the container manually at n=223/225 and recorded the runaway-reasoning failure mode as a real characteristic of hybrid-thinking MoE models on agentic harnesses.

The next night, we resumed via Aider's official --cont flag against the same run directory. Two missing exercises (rust/forth and javascript/go-counting) ran in parallel under --threads 4 and completed in about 6 minutes each. Both failed both attempts. Final result: n=225/225, pass_rate_2 = 62.2%.

The headline ticked down by 0.6 percentage points compared to the n=223 partial (62.8% → 62.2%) because the two missing exercises both failed. That's the most honest defense against any "stopped early to lock in a favorable number" critique: completing the run actually hurt us.

If you reproduce this and see a similar hang, kill the container, run with --cont later to fill in the gaps. The full data is healthier than a partial.

4. The other thing we wanted to test

With Qwen3.6 in hand, the natural next move was a comparison candidate. The ideal contrast: a dense model purpose-built for agentic coding, not a general-purpose coder.

Devstral-Small-2-24B-Instruct-2512 was the obvious pick. Mistral and All Hands AI co-trained it specifically for software-engineering agents, it's Apache 2.0 dense 24B, has a 256K context window, and Mistral published 68.0% SWE-Bench Verified for it (a real number on a real benchmark). Released November 2025, so 5 months old at time of writing. Architecture is the new "Ministral 3 with rope-scaling and Scalable-Softmax" stack from Mistral, structurally different from Devstral 1.x.

We deployed it via the same LLMKube + Metal Agent path, kicked off Aider Polyglot with --num-tests 25 (random subset, fits a 4-hour window at Devstral's slower decode speed of ~24 t/s), edit format diff.

Result: pass_rate_2 = 4.0% (1 of 25).

Almost wrote it off as broken. Then read the Aider results files more carefully:

92% of responses were syntactically well-formed diffs.
Zero exhausted context windows.
Average 4.4 minutes per exercise (fast, not stuck).
The model was producing valid-looking edit blocks, they were just semantically wrong.

The model wasn't broken. It was doing what it had been trained to do, which apparently wasn't this.

5. Investigation

Three hypotheses, ordered by what we tried:

Hypothesis 1: The diff format is the problem. Aider supports --edit-format whole (output complete files instead of diffs). Re-ran with whole format on the same 25-exercise subset.

Result: pass_rate_2 = 8.0% (2 of 25). Better, but not by much. Hypothesis weakly supported.

Hypothesis 2: llama.cpp isn't handling Devstral 2's new architecture correctly. Worth checking before declaring the model bad. We ran HumanEval+ via evalplus, pointed at the same llama-server endpoint, with a function-level Python coding harness that doesn't require any agentic tool-call discipline. If llama.cpp's tokenizer or attention implementation was off, we'd see it here.

Result: HumanEval pass@1 = 85.4%, HumanEval+ pass@1 = 81.7% (164 problems, scored in ganler/evalplus Linux container because macOS's setrlimit(RLIMIT_AS) doesn't behave the way evalplus's sandbox expects).

That landed Devstral 2 in the same band as the top open-source 24B coders for function-level Python. Architecture is fine. llama.cpp is fine. The model is genuinely capable.

Hypothesis 3: The harness is the variable. We re-read Mistral's README:

Devstral 2 can also be used with the following scaffoldings:

Mistral Vibe (recommended)

Cline

Kilo Code

Claude Code

OpenHands

SWE Agent

Aider is not on this list. Devstral 2 was trained on tool-call traces from agentic-coding harnesses that use multi-turn function calls, not Aider's single-prompt-with-diff edit format. The model was producing what its training distribution rewarded; Aider's harness was scoring it on a different distribution entirely.

Mistral itself adds, in the same README:

we advise everyone to use the Mistral AI API if the model is subpar with local serving

That's an explicit caveat from the model authors. The 4% wasn't a model failure or a runtime failure. It was a harness-distribution mismatch, exactly the failure mode the README warned about.

6. Same model, three benchmarks, three answers

Benchmark	Devstral 2 score
Aider Polyglot, diff format	4.0%
Aider Polyglot, whole format	8.0%
HumanEval+ (with adversarial tests)	81.7%
HumanEval (base)	85.4%

Twenty times difference in measured "performance" on the same model, same hardware, same temperature, same week. This is the lesson worth taking away from the entire bench session.

If you publish a single benchmark number for any agentic coding model, you are publishing a story about that model's compatibility with one specific harness, not a story about the model's coding capability. The Devstral 2 4% on Aider does not mean Devstral 2 is bad at coding. The Devstral 2 81.7% on HumanEval+ does not mean Devstral 2 is good at agentic edits in your IDE. They are both true and they describe different things.

If you want to evaluate a coding model, run it through the harness you actually use day to day. If you can't, then quote at least two benchmarks from different parts of the harness landscape (one function-level, one agentic) and let the reader see the spread.

7. InferCost was running the whole time

While the benchmarks were producing accuracy numbers, InferCost was producing the cost numbers. The new Apple Silicon collector (shipped in InferCost v0.3.0) was reconciling the apple-m5-max CostProfile every 30 seconds against the LLMKube Metal Agent's apple_power_combined_watts gauge.

Specifically, two things were running in the background of every benchmark above:

A second LLMKube Metal Agent on port 9091 with --apple-power-enabled, publishing the four new apple_power_*_watts Prometheus gauges sourced from a sudo'd powermetrics subprocess. Pinned-argv NOPASSWD sudoers entry to keep the privilege grant tight (security audit caught and fixed three findings before merge: argv pinning, bin override rejection, absolute /usr/bin/sudo to defeat $PATH attacks).
InferCost as a local controller, pointed at :9091/metrics via the new --metal-endpoint CLI flag, reconciling an apple-m5-max CostProfile using the new Metal scraper and dispatcher.

Plus a tiny CSV poller that sampled both layers every 60 seconds, writing 388 rows of telemetry across the day.

Per-window aggregates, captured live during the runs:

Window	Duration	Mean combined W	Mean InferCost $/hr	Agent ↔ InferCost Δ (mean)
Qwen3.6-35B-A3B Q8 (full Aider)	200 min	27.3 W	$0.1775	1.60 W
Devstral 2, Aider diff	32 min	32.7 W	$0.1773	6.21 W
Devstral 2, Aider whole	29 min	35.3 W	$0.1774	8.08 W
Devstral 2, HumanEval+	55 min	29.0 W	$0.1770	0.90 W

The "Agent ↔ InferCost Δ" column is the validation result. The agent reads powermetrics every second; InferCost samples the gauge during its 30-second reconcile loop. If they were deeply wrong about each other we'd see double-digit deltas. We don't. Across the four windows, mean delta ranged from 0.9 W to 8 W (the 8 W was during Aider whole format, which has bursty prefill that the 30-second reconcile sometimes catches mid-spike). For sustained workloads the agreement is sub-watt.

Here is what kubectl get costprofile apple-m5-max -o yaml looked like during the run:

status:
  currentPowerDrawWatts: 39.13
  hourlyCostUSD: 0.1805
  amortizationRatePerHour: 0.17466
  electricityCostPerHour: 0.00341
  conditions:
  - type: MetalReachable
    status: "True"
    reason: MetalHealthy
    message: "Metal agent healthy at http://localhost:9091/metrics
              (39.1W combined; gpu=37.3 cpu=1.8 ane=0.0)."
  - type: Ready
    status: "True"
    reason: CostComputed
    message: "Hourly cost: $0.1805 (amort: $0.1747, elec: $0.0059)"

Not a screenshot. Not a slide. The actual reconcile output from a Kubernetes operator scraping a sudo'd powermetrics subprocess on the same Mac that was running the benchmark.

8. The cost economics

The $4,500 laptop, amortized over 3 years, with maintenance at 2% of the purchase price flat:

Amortization per hour: $4,500 × 1.02 / 3 / 8760 = $0.17466/hr
Electricity at 41 W and $0.08/kWh (Peninsula Light residential rate, WA): 0.041 × 0.08 = $0.00328/hr
Total hourly: $0.178/hr, of which 98.1% is amortization and 1.9% is electricity

That ratio is the most useful thing the bench taught us. The marginal cost of running an LLM on a laptop you already own is essentially the electricity, which on Apple Silicon is genuinely cheap. The amortized cost is the laptop existing at all, which you pay whether or not the model runs.

Two $/MTok numbers from the windows where the token poller was working correctly:

Window	Total tokens	$/MTok
Devstral 2, Aider whole (sustained edits)	158,614	$0.30
Devstral 2, HumanEval+ (sequential function calls)	90,916	$1.76

Aider's whole-file edits keep the GPU producing tokens for longer continuous bursts, which spreads the fixed amortization across more output. HumanEval+ runs many short function-level problems with eval-script setup time between them, which inflates the per-token cost because the laptop is "active" but not generating.

Stacked against Anthropic's published 2026 pricing of $3/MT input + $15/MT output for Claude Sonnet 4.6, blended around $6 to $9 per million total tokens depending on input:output ratio:

Local Devstral 2 sustained at $0.30/MTok: about 30× cheaper at the margin than cloud Sonnet 4.6.
Local Devstral 2 with idle gaps at $1.76/MTok: about 5× cheaper at the margin.

Both ratios assume the laptop is running 24/7 for the 3-year amortization horizon. If you actually use the laptop 8 hours a day, the effective amortization-per-active-hour is 3× higher, which compresses the ratio. If you use it 2 hours a day, 12× higher, ratio collapses. The InferCost UsageReport CRD is built specifically to compute the active vs idle split over a billing period, which is the FinOps question that nobody else is answering for Apple Silicon.

9. What we shipped today, and how to use it

Two releases shipped alongside this post, both of which were necessary to do the cost story above end to end:

LLMKube v0.7.2: Apple Silicon power gauges via powermetrics + one-command sudoers install

Adds 4 new Prometheus gauges (combined / gpu / cpu / ane watts) to the existing Metal Agent (PR #334)
Sourced from a sudo'd powermetrics --samplers cpu_power,gpu_power -i 1000 subprocess
Opt-in via --apple-power-enabled flag (defaults off)
NOPASSWD sudoers fragment with pinned argv for safe install (security audit caught and fixed three findings before merge: argv pinning, --powermetrics-bin override rejection, absolute /usr/bin/sudo to defeat $PATH substitution attacks)
New make install-powermetrics-sudo and make uninstall-powermetrics-sudo targets (PR #336) so the privileged install is one command instead of a 5-line sed + visudo + install shell incantation
Coverage gap closed: extracted helper at 100% test coverage
Zero impact on existing setups; without the flag, behavior is unchanged

InferCost v0.3.0: Apple Silicon (Metal) power collector

Adds internal/scraper/metal.go mirroring the existing DCGM scraper (PR #47)
New MetalReachable condition with reasons MetalHealthy / MetalNotConfigured / MetalScrapeError / MetalSamplerOff so operators on a Mac don't see "DCGM unreachable" messages
10-line dispatcher in the CostProfile reconciler keys off MetalEndpoint set + looksApple(gpuModel)
New apple-m5-max.yaml sample CostProfile and updated apple-m2-ultra.yaml with real setup steps
8 controller tests + 5 scraper tests; existing DCGM tests untouched

If you have a MacBook Pro M5 (or M3/M4 Max with enough memory), the full install is now five short steps:

# 1. Install llama.cpp (needed by the Metal Agent for serving GGUF weights)
brew install llama.cpp

# 2. Install LLMKube via Helm
helm repo add llmkube https://defilantech.github.io/llmkube
helm install llmkube llmkube/llmkube --version 0.7.2

# 3. Build + install the Metal Agent and grant powermetrics access
git clone https://github.com/defilantech/llmkube && cd llmkube
make install-metal-agent          # builds + installs the launchd service
make install-powermetrics-sudo    # one-command pinned-argv NOPASSWD sudoers install

# 4. Restart the agent with --apple-power-enabled in your launchd plist
#    (edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist, then reload)
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load   ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

# 5. Deploy InferCost pointed at the agent and apply the sample CostProfile
helm repo add infercost https://defilantech.github.io/infercost
helm install infercost infercost/infercost --version 0.3.0 \
  --set metal.endpoint=http://localhost:9090/metrics
kubectl apply -f https://raw.githubusercontent.com/defilantech/infercost/main/config/samples/costprofiles/apple-m5-max.yaml

# Watch the live reconcile
kubectl get costprofile apple-m5-max -o yaml

The make install-powermetrics-sudo step is the one privileged moment: sudo prompts you for your password, the make target validates the sudoers syntax with visudo -cf before installing, then echoes the granted command back so you can verify exactly what was authorized. The grant is scoped to /usr/bin/powermetrics --samplers cpu_power\,gpu_power -i [0-9]* and nothing else. To remove it later, make uninstall-powermetrics-sudo.

Edit purchasePriceUSD, electricity.ratePerKWh, and nodeSelector in the CostProfile to match your reality.

Both projects are open source and hungry for the kind of feedback that comes from running them on hardware we don't have:

LLMKube (github.com/defilantech/llmkube). Kubernetes-native LLM serving operator. Runs llama.cpp and vLLM on NVIDIA, Metal Agent for Apple Silicon. Stars and good-first-issue PRs both very welcome. The Metal Agent in particular benefits enormously from Mac-having developers running it through --apple-power-enabled, finding the edge cases we missed, and filing issues.
InferCost (github.com/defilantech/infercost). Kubernetes-native AI FinOps. Cost attribution per workload, namespace, and model, with both NVIDIA (DCGM) and now Apple Silicon (this PR) power sources. The UsageReport CRD is the next thing to push on; if you have a multi-Mac fleet or a mixed NVIDIA+Apple environment, we'd love to hear what reports would help your team.

10. Reproducibility

Every number in this post traces back to a file you can pull or a benchmark you can re-run.

LLMKube: github.com/defilantech/llmkube, main branch at commit 58a94a7 (PR #334 merged). Issue #335 closed.
InferCost: github.com/defilantech/infercost, main branch at commit 422a4f0 (PR #47 merged). Issue #46 closed.
Aider Polyglot harness: github.com/Aider-AI/aider with polyglot-benchmark exercises.
Aider Polyglot leaderboard: polyglot_leaderboard.yml on aider main.
evalplus: github.com/evalplus/evalplus, scored via ganler/evalplus container for the macOS RLIMIT_AS workaround.
Run scripts: aider/run-aider-polyglot.sh (Qwen) and aider/run-aider-devstral.sh (Devstral) on this host, both straightforward Bash that invoke the Aider docker container with the right model id and edit format.
Power + cost telemetry: /tmp/infercost-m5max-telemetry.csv (388 power samples) and /tmp/infercost-m5max-tokens.csv (333 llama-server token-counter samples). Window markers (# QWEN_RUN_END, # DEVSTRAL_RUN_START, etc.) inline in the CSV.
Sample CostProfile: config/samples/costprofiles/apple-m5-max.yaml in the InferCost repo.

If a reproducer hits something different, please open an issue against whichever repo is the closest fit. The Apple Silicon path in particular is brand new, and the cohort of people who could give it a real workout is small but motivated.

What's next

A few things the data points to:

The InferCost UsageReport CRD needs a real multi-day test on a Mac running mixed inference + idle. The active vs idle split is the FinOps lever for local models, and we have one day of data; we want a month.
Multi-Mac fleet support in InferCost (auto-discovery of LLMKube Metal Agents via label selector) would let teams deploy InferCost once and have it follow agents around. Issue tracking that is open.
We benched Devstral 2 on Aider and HumanEval+. We did not bench it on its native scaffold (Mistral Vibe / OpenHands / Cline). That comparison is the right one for a daily-driver evaluation and it's the next thing we'll publish.

If you're running local LLM inference on your own hardware and care about either the serving side (LLMKube) or the cost side (InferCost), the easiest way to push these projects forward is to point them at your environment, file the issue you'd want to fix, and let us know what number would actually help your team.

Both projects are Apache 2.0. Stars on LLMKube and InferCost are appreciated and signal the kind of validation that helps prioritize the next round of work.

DEV Community