Originally published at llmkube.com/blog/m5-max-aider-polyglot-and-finops. Cross-posted here for the dev.to audience.
A 24-hour Aider Polyglot run, a follow-up bench that blew up in interesting ways, and a working $/MTok number from a Kubernetes operator that scrapes Apple Silicon power live. Two open-source PRs landed today to make all of this reproducible on any M-series Mac.
This is a coding-model benchmark on locally-served weights, plus a FinOps story. Every benchmark number traces to results files we can show you. Every cost number traces to a CSV captured by InferCost during the run. The point is the methodology and the tooling; the model rankings are along for the ride.
TL;DR
- Qwen3.6-35B-A3B Q8 (Tongyi Lab, Apache 2.0) hit 62.2% on Aider Polyglot (pass_rate_2, n=225/225) running locally on a MacBook Pro M5 Max via LLMKube's Metal Agent. That places it above Claude Sonnet 4 with 32k thinking budget (61.3%), o1-high (61.7%), DeepSeek R1 original (56.9%), and Claude 3.5 Sonnet (51.6%) on the official Aider leaderboard. It also beats every published Qwen-family entry on the Polyglot board.
- Devstral-Small-2-2512 Q8 (Mistral, Apache 2.0) hit 4% on Aider Polyglot diff format, 8% on Aider Polyglot whole format, and 81.7% on HumanEval+ (164 problems, all passed standard). Same model. 20× swing. Benchmark numbers don't transfer across harnesses, and you should never quote one without naming the other.
-
InferCost ran the whole time. The new Apple Silicon collector (shipped in InferCost v0.3.0) reconciled
$0.18/hragainst theapple-m5-maxCostProfile, with InferCost's reading agreeing with the LLMKube agent's direct gauge within1.6 Wmean delta over the Qwen window. First widely-published$/MToknumber for an Apple Silicon LLM workload that traces to a real Prometheus scrape. -
Two releases shipped alongside this post make all of it reproducible on your own Mac: LLMKube v0.7.2 (Apple power gauges via powermetrics, security-hardened sudoers, and a one-command
make install-powermetrics-sudo) and InferCost v0.3.0 (Metal collector, condition reporting, sample CostProfile).
1. The hardware and what's special about it
The bench machine is a MacBook Pro M5 Max, 2026 model:
| GPU | 40-core integrated, Metal 4 |
| CPU | 18-core (6 P-cores, 12 E-cores) |
| Unified memory | 128 GB |
| Memory bandwidth | 614 GB/s |
| OS | macOS 25.4 (Darwin) |
| Price | About $4,500 fully configured |
Source: Apple newsroom.
The 614 GB/s bandwidth is the constraint that decides everything that follows. For a dense 24B model at Q8, you need to read about 25 GB per generated token, so the upper bound is 614 / 25 = 24.56 t/s and we measured 24 t/s, within 2.3% of the wall. For a MoE like Qwen3.6-35B-A3B, only the active 3B parameters read per token, so the wall is ~200 t/s and you actually get to choose how to spend the bandwidth. That's the whole story behind why MoE feels fast on a Mac.
Stack: LLMKube v0.7.x with the Metal Agent feature branch from PR #334 cherry-picked in (now main), llama-server from llama.cpp Metal, and a kind cluster on the same host for the K8s control plane. InferCost was running locally via go run ./cmd/main.go, pointed at the LLMKube agent's /metrics endpoint via a new --metal-endpoint flag.
2. Qwen3.6-35B-A3B Q8 on Aider Polyglot
The Qwen3.6 family includes a dense 27B and an MoE variant at 35B total / 3B active per token. We ran the MoE quantized to Q8_0 (~36 GB on disk, fits comfortably in 128 GB unified memory with room for KV cache and the rest of macOS).
Aider Polyglot is a 225-problem benchmark across C++, Go, Java, JavaScript, Python, and Rust, designed to keep top frontier coding LLMs in the 5-50% range. Each model gets two attempts per problem: a single-shot solve, and a second attempt after seeing the failed test output. The headline metric is pass_rate_2, the percentage of problems that passed all tests within those two attempts.
Aider was driven from inside a Docker container (aider-benchmark image) talking to llama-server via OPENAI_API_BASE=http://host.docker.internal:<port>/v1. Edit format was diff (Aider's standard for capable models). Threads = 4. The model id we passed to LiteLLM was openai/Qwen3.6-35B-A3B-Q8_0.gguf, the basename llama-server reports.
The full run took 49.9 hours of inference wall-clock time stretched across about 24 hours of real time, plus a follow-up resume cycle to handle a runaway-reasoning failure mode. More on that in §4.
The headline result
pass_rate_2 = 62.2% (140 of 225), pass_rate_1 = 34.7% (78 of 225).
Verified against the official Aider Polyglot leaderboard yaml pulled today, here's where that lands among the published baselines:
| pass_rate_2 | format | model |
|---|---|---|
| 88.0% | diff | gpt-5 (high) |
| 84.9% | diff | o3-pro (high) |
| 81.3% | diff | o3 (high) |
| 79.6% | diff | grok-4 (high) |
| 72.0% | diff | Claude Opus 4 (32k thinking) |
| 71.4% | diff | DeepSeek R1 (0528) |
| 64.9% | diff | Claude 3.7 Sonnet (32k thinking) |
| 64.0% | architect | DeepSeek R1 + Claude 3.5 Sonnet |
| 62.2% | diff | Qwen3.6-35B-A3B Q8 (this run, M5 Max, Apache 2.0, ours) |
| 61.7% | diff | o1-2024-12-17 (high) |
| 61.3% | diff | Claude Sonnet 4 (32k thinking) |
| 60.4% | diff | Claude 3.7 Sonnet (no thinking) |
| 59.6% | diff | Qwen3 235B A22B (no think, Alibaba API) |
| 56.9% | diff | DeepSeek R1 (original) |
| 56.4% | diff | Claude Sonnet 4 (no thinking) |
| 51.6% | diff | Claude 3.5 Sonnet |
| 40.0% | diff | Qwen3 32B |
| 8.0% | diff | Qwen2.5-Coder-32B |
The defensible reads:
- Beats Claude Sonnet 4 with 32k thinking budget by 0.9 percentage points.
- Beats o1-high by 0.5 percentage points.
- Beats DeepSeek R1 original by 5.3 percentage points.
- Beats Claude 3.5 Sonnet by 10.6 percentage points.
- Within 2.7 points of Claude 3.7 Sonnet (32k thinking).
- Strongest open-weights Qwen-family number on the Polyglot leaderboard. Qwen3 32B sat at 40.0%, Qwen3 235B A22B at 59.6%. The 35B-A3B MoE quantization is doing real work for its size.
What we are not claiming: that this beats Opus 4, GPT-5, o3, or DeepSeek V3.2-Exp Reasoner. Those all sit above us on the leaderboard. Qwen3.6 is in the same band as Sonnet 4 thinking, not in the band with o3-high or GPT-5.
Per-language
| Language | n | pass_1 | pass_2 | p2 % | avg min/exercise |
|---|---|---|---|---|---|
| python | 34 | 14 | 25 | 73.5% | 4.8 |
| javascript | 49 | 20 | 35 | 71.4% | 5.3 |
| go | 39 | 16 | 24 | 61.5% | 6.2 |
| rust | 30 | 14 | 17 | 56.7% | 9.2 |
| cpp | 26 | 4 | 14 | 53.8% | 21.7 |
| java | 47 | 10 | 25 | 53.2% | 31.9 |
Two things worth noting. First, Python and JavaScript at ~73% looks like clean Sonnet-3.5-thinking territory on the languages most developers actually use Aider for. Second, Java at 31.9 minutes per exercise on average is inflated by the runaway-reasoning case described next. Strip the outlier and Java's average is in line with C++.
3. The runaway-reasoning failure mode (and the resume that closed it out)
About 21 hours into the run, the container settled into a Java exercise that consumed 80 minutes of wall time without writing a new result file or producing meaningful output. The log mtime stayed frozen, the container stayed "Up," and the model was clearly deep in a reasoning loop with no exit strategy. We stopped the container manually at n=223/225 and recorded the runaway-reasoning failure mode as a real characteristic of hybrid-thinking MoE models on agentic harnesses.
The next night, we resumed via Aider's official --cont flag against the same run directory. Two missing exercises (rust/forth and javascript/go-counting) ran in parallel under --threads 4 and completed in about 6 minutes each. Both failed both attempts. Final result: n=225/225, pass_rate_2 = 62.2%.
The headline ticked down by 0.6 percentage points compared to the n=223 partial (62.8% → 62.2%) because the two missing exercises both failed. That's the most honest defense against any "stopped early to lock in a favorable number" critique: completing the run actually hurt us.
If you reproduce this and see a similar hang, kill the container, run with --cont later to fill in the gaps. The full data is healthier than a partial.
4. The other thing we wanted to test
With Qwen3.6 in hand, the natural next move was a comparison candidate. The ideal contrast: a dense model purpose-built for agentic coding, not a general-purpose coder.
Devstral-Small-2-24B-Instruct-2512 was the obvious pick. Mistral and All Hands AI co-trained it specifically for software-engineering agents, it's Apache 2.0 dense 24B, has a 256K context window, and Mistral published 68.0% SWE-Bench Verified for it (a real number on a real benchmark). Released November 2025, so 5 months old at time of writing. Architecture is the new "Ministral 3 with rope-scaling and Scalable-Softmax" stack from Mistral, structurally different from Devstral 1.x.
We deployed it via the same LLMKube + Metal Agent path, kicked off Aider Polyglot with --num-tests 25 (random subset, fits a 4-hour window at Devstral's slower decode speed of ~24 t/s), edit format diff.
Result: pass_rate_2 = 4.0% (1 of 25).
Almost wrote it off as broken. Then read the Aider results files more carefully:
- 92% of responses were syntactically well-formed diffs.
- Zero exhausted context windows.
- Average 4.4 minutes per exercise (fast, not stuck).
- The model was producing valid-looking edit blocks, they were just semantically wrong.
The model wasn't broken. It was doing what it had been trained to do, which apparently wasn't this.
5. Investigation
Three hypotheses, ordered by what we tried:
Hypothesis 1: The diff format is the problem. Aider supports --edit-format whole (output complete files instead of diffs). Re-ran with whole format on the same 25-exercise subset.
Result: pass_rate_2 = 8.0% (2 of 25). Better, but not by much. Hypothesis weakly supported.
Hypothesis 2: llama.cpp isn't handling Devstral 2's new architecture correctly. Worth checking before declaring the model bad. We ran HumanEval+ via evalplus, pointed at the same llama-server endpoint, with a function-level Python coding harness that doesn't require any agentic tool-call discipline. If llama.cpp's tokenizer or attention implementation was off, we'd see it here.
Result: HumanEval pass@1 = 85.4%, HumanEval+ pass@1 = 81.7% (164 problems, scored in ganler/evalplus Linux container because macOS's setrlimit(RLIMIT_AS) doesn't behave the way evalplus's sandbox expects).
That landed Devstral 2 in the same band as the top open-source 24B coders for function-level Python. Architecture is fine. llama.cpp is fine. The model is genuinely capable.
Hypothesis 3: The harness is the variable. We re-read Mistral's README:
Devstral 2 can also be used with the following scaffoldings:
- Mistral Vibe (recommended)
- Cline
- Kilo Code
- Claude Code
- OpenHands
- SWE Agent
Aider is not on this list. Devstral 2 was trained on tool-call traces from agentic-coding harnesses that use multi-turn function calls, not Aider's single-prompt-with-diff edit format. The model was producing what its training distribution rewarded; Aider's harness was scoring it on a different distribution entirely.
Mistral itself adds, in the same README:
we advise everyone to use the Mistral AI API if the model is subpar with local serving
That's an explicit caveat from the model authors. The 4% wasn't a model failure or a runtime failure. It was a harness-distribution mismatch, exactly the failure mode the README warned about.
6. Same model, three benchmarks, three answers
| Benchmark | Devstral 2 score |
|---|---|
| Aider Polyglot, diff format | 4.0% |
| Aider Polyglot, whole format | 8.0% |
| HumanEval+ (with adversarial tests) | 81.7% |
| HumanEval (base) | 85.4% |
Twenty times difference in measured "performance" on the same model, same hardware, same temperature, same week. This is the lesson worth taking away from the entire bench session.
If you publish a single benchmark number for any agentic coding model, you are publishing a story about that model's compatibility with one specific harness, not a story about the model's coding capability. The Devstral 2 4% on Aider does not mean Devstral 2 is bad at coding. The Devstral 2 81.7% on HumanEval+ does not mean Devstral 2 is good at agentic edits in your IDE. They are both true and they describe different things.
If you want to evaluate a coding model, run it through the harness you actually use day to day. If you can't, then quote at least two benchmarks from different parts of the harness landscape (one function-level, one agentic) and let the reader see the spread.
7. InferCost was running the whole time
While the benchmarks were producing accuracy numbers, InferCost was producing the cost numbers. The new Apple Silicon collector (shipped in InferCost v0.3.0) was reconciling the apple-m5-max CostProfile every 30 seconds against the LLMKube Metal Agent's apple_power_combined_watts gauge.
Specifically, two things were running in the background of every benchmark above:
- A second LLMKube Metal Agent on port 9091 with
--apple-power-enabled, publishing the four newapple_power_*_wattsPrometheus gauges sourced from a sudo'dpowermetricssubprocess. Pinned-argv NOPASSWD sudoers entry to keep the privilege grant tight (security audit caught and fixed three findings before merge: argv pinning, bin override rejection, absolute/usr/bin/sudoto defeat $PATH attacks). - InferCost as a local controller, pointed at
:9091/metricsvia the new--metal-endpointCLI flag, reconciling anapple-m5-maxCostProfile using the new Metal scraper and dispatcher.
Plus a tiny CSV poller that sampled both layers every 60 seconds, writing 388 rows of telemetry across the day.
Per-window aggregates, captured live during the runs:
| Window | Duration | Mean combined W | Mean InferCost $/hr | Agent ↔ InferCost Δ (mean) |
|---|---|---|---|---|
| Qwen3.6-35B-A3B Q8 (full Aider) | 200 min | 27.3 W | $0.1775 | 1.60 W |
| Devstral 2, Aider diff | 32 min | 32.7 W | $0.1773 | 6.21 W |
| Devstral 2, Aider whole | 29 min | 35.3 W | $0.1774 | 8.08 W |
| Devstral 2, HumanEval+ | 55 min | 29.0 W | $0.1770 | 0.90 W |
The "Agent ↔ InferCost Δ" column is the validation result. The agent reads powermetrics every second; InferCost samples the gauge during its 30-second reconcile loop. If they were deeply wrong about each other we'd see double-digit deltas. We don't. Across the four windows, mean delta ranged from 0.9 W to 8 W (the 8 W was during Aider whole format, which has bursty prefill that the 30-second reconcile sometimes catches mid-spike). For sustained workloads the agreement is sub-watt.
Here is what kubectl get costprofile apple-m5-max -o yaml looked like during the run:
status:
currentPowerDrawWatts: 39.13
hourlyCostUSD: 0.1805
amortizationRatePerHour: 0.17466
electricityCostPerHour: 0.00341
conditions:
- type: MetalReachable
status: "True"
reason: MetalHealthy
message: "Metal agent healthy at http://localhost:9091/metrics
(39.1W combined; gpu=37.3 cpu=1.8 ane=0.0)."
- type: Ready
status: "True"
reason: CostComputed
message: "Hourly cost: $0.1805 (amort: $0.1747, elec: $0.0059)"
Not a screenshot. Not a slide. The actual reconcile output from a Kubernetes operator scraping a sudo'd powermetrics subprocess on the same Mac that was running the benchmark.
8. The cost economics
The $4,500 laptop, amortized over 3 years, with maintenance at 2% of the purchase price flat:
- Amortization per hour:
$4,500 × 1.02 / 3 / 8760 = $0.17466/hr - Electricity at 41 W and $0.08/kWh (Peninsula Light residential rate, WA):
0.041 × 0.08 = $0.00328/hr - Total hourly: $0.178/hr, of which 98.1% is amortization and 1.9% is electricity
That ratio is the most useful thing the bench taught us. The marginal cost of running an LLM on a laptop you already own is essentially the electricity, which on Apple Silicon is genuinely cheap. The amortized cost is the laptop existing at all, which you pay whether or not the model runs.
Two $/MTok numbers from the windows where the token poller was working correctly:
| Window | Total tokens | $/MTok |
|---|---|---|
| Devstral 2, Aider whole (sustained edits) | 158,614 | $0.30 |
| Devstral 2, HumanEval+ (sequential function calls) | 90,916 | $1.76 |
Aider's whole-file edits keep the GPU producing tokens for longer continuous bursts, which spreads the fixed amortization across more output. HumanEval+ runs many short function-level problems with eval-script setup time between them, which inflates the per-token cost because the laptop is "active" but not generating.
Stacked against Anthropic's published 2026 pricing of $3/MT input + $15/MT output for Claude Sonnet 4.6, blended around $6 to $9 per million total tokens depending on input:output ratio:
- Local Devstral 2 sustained at $0.30/MTok: about 30× cheaper at the margin than cloud Sonnet 4.6.
- Local Devstral 2 with idle gaps at $1.76/MTok: about 5× cheaper at the margin.
Both ratios assume the laptop is running 24/7 for the 3-year amortization horizon. If you actually use the laptop 8 hours a day, the effective amortization-per-active-hour is 3× higher, which compresses the ratio. If you use it 2 hours a day, 12× higher, ratio collapses. The InferCost UsageReport CRD is built specifically to compute the active vs idle split over a billing period, which is the FinOps question that nobody else is answering for Apple Silicon.
9. What we shipped today, and how to use it
Two releases shipped alongside this post, both of which were necessary to do the cost story above end to end:
LLMKube v0.7.2: Apple Silicon power gauges via powermetrics + one-command sudoers install
- Adds 4 new Prometheus gauges (
combined / gpu / cpu / anewatts) to the existing Metal Agent (PR #334) - Sourced from a sudo'd
powermetrics --samplers cpu_power,gpu_power -i 1000subprocess - Opt-in via
--apple-power-enabledflag (defaults off) - NOPASSWD sudoers fragment with pinned argv for safe install (security audit caught and fixed three findings before merge: argv pinning,
--powermetrics-binoverride rejection, absolute/usr/bin/sudoto defeat $PATH substitution attacks) - New
make install-powermetrics-sudoandmake uninstall-powermetrics-sudotargets (PR #336) so the privileged install is one command instead of a 5-linesed+visudo+installshell incantation - Coverage gap closed: extracted helper at 100% test coverage
- Zero impact on existing setups; without the flag, behavior is unchanged
InferCost v0.3.0: Apple Silicon (Metal) power collector
- Adds
internal/scraper/metal.gomirroring the existing DCGM scraper (PR #47) - New
MetalReachablecondition with reasonsMetalHealthy / MetalNotConfigured / MetalScrapeError / MetalSamplerOffso operators on a Mac don't see "DCGM unreachable" messages - 10-line dispatcher in the CostProfile reconciler keys off
MetalEndpointset +looksApple(gpuModel) - New
apple-m5-max.yamlsample CostProfile and updatedapple-m2-ultra.yamlwith real setup steps - 8 controller tests + 5 scraper tests; existing DCGM tests untouched
If you have a MacBook Pro M5 (or M3/M4 Max with enough memory), the full install is now five short steps:
# 1. Install llama.cpp (needed by the Metal Agent for serving GGUF weights)
brew install llama.cpp
# 2. Install LLMKube via Helm
helm repo add llmkube https://defilantech.github.io/llmkube
helm install llmkube llmkube/llmkube --version 0.7.2
# 3. Build + install the Metal Agent and grant powermetrics access
git clone https://github.com/defilantech/llmkube && cd llmkube
make install-metal-agent # builds + installs the launchd service
make install-powermetrics-sudo # one-command pinned-argv NOPASSWD sudoers install
# 4. Restart the agent with --apple-power-enabled in your launchd plist
# (edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist, then reload)
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
# 5. Deploy InferCost pointed at the agent and apply the sample CostProfile
helm repo add infercost https://defilantech.github.io/infercost
helm install infercost infercost/infercost --version 0.3.0 \
--set metal.endpoint=http://localhost:9090/metrics
kubectl apply -f https://raw.githubusercontent.com/defilantech/infercost/main/config/samples/costprofiles/apple-m5-max.yaml
# Watch the live reconcile
kubectl get costprofile apple-m5-max -o yaml
The make install-powermetrics-sudo step is the one privileged moment: sudo prompts you for your password, the make target validates the sudoers syntax with visudo -cf before installing, then echoes the granted command back so you can verify exactly what was authorized. The grant is scoped to /usr/bin/powermetrics --samplers cpu_power\,gpu_power -i [0-9]* and nothing else. To remove it later, make uninstall-powermetrics-sudo.
Edit purchasePriceUSD, electricity.ratePerKWh, and nodeSelector in the CostProfile to match your reality.
Both projects are open source and hungry for the kind of feedback that comes from running them on hardware we don't have:
-
LLMKube (github.com/defilantech/llmkube). Kubernetes-native LLM serving operator. Runs llama.cpp and vLLM on NVIDIA, Metal Agent for Apple Silicon. Stars and
good-first-issuePRs both very welcome. The Metal Agent in particular benefits enormously from Mac-having developers running it through--apple-power-enabled, finding the edge cases we missed, and filing issues. -
InferCost (github.com/defilantech/infercost). Kubernetes-native AI FinOps. Cost attribution per workload, namespace, and model, with both NVIDIA (DCGM) and now Apple Silicon (this PR) power sources. The
UsageReportCRD is the next thing to push on; if you have a multi-Mac fleet or a mixed NVIDIA+Apple environment, we'd love to hear what reports would help your team.
10. Reproducibility
Every number in this post traces back to a file you can pull or a benchmark you can re-run.
- LLMKube: github.com/defilantech/llmkube, main branch at commit
58a94a7(PR #334 merged). Issue #335 closed. - InferCost: github.com/defilantech/infercost, main branch at commit
422a4f0(PR #47 merged). Issue #46 closed. - Aider Polyglot harness: github.com/Aider-AI/aider with polyglot-benchmark exercises.
- Aider Polyglot leaderboard: polyglot_leaderboard.yml on aider main.
- evalplus: github.com/evalplus/evalplus, scored via
ganler/evalpluscontainer for the macOSRLIMIT_ASworkaround. - Run scripts:
aider/run-aider-polyglot.sh(Qwen) andaider/run-aider-devstral.sh(Devstral) on this host, both straightforward Bash that invoke the Aider docker container with the right model id and edit format. - Power + cost telemetry:
/tmp/infercost-m5max-telemetry.csv(388 power samples) and/tmp/infercost-m5max-tokens.csv(333 llama-server token-counter samples). Window markers (# QWEN_RUN_END,# DEVSTRAL_RUN_START, etc.) inline in the CSV. - Sample CostProfile:
config/samples/costprofiles/apple-m5-max.yamlin the InferCost repo.
If a reproducer hits something different, please open an issue against whichever repo is the closest fit. The Apple Silicon path in particular is brand new, and the cohort of people who could give it a real workout is small but motivated.
What's next
A few things the data points to:
- The InferCost
UsageReportCRD needs a real multi-day test on a Mac running mixed inference + idle. The active vs idle split is the FinOps lever for local models, and we have one day of data; we want a month. - Multi-Mac fleet support in InferCost (auto-discovery of LLMKube Metal Agents via label selector) would let teams deploy InferCost once and have it follow agents around. Issue tracking that is open.
- We benched Devstral 2 on Aider and HumanEval+. We did not bench it on its native scaffold (Mistral Vibe / OpenHands / Cline). That comparison is the right one for a daily-driver evaluation and it's the next thing we'll publish.
If you're running local LLM inference on your own hardware and care about either the serving side (LLMKube) or the cost side (InferCost), the easiest way to push these projects forward is to point them at your environment, file the issue you'd want to fix, and let us know what number would actually help your team.
Both projects are Apache 2.0. Stars on LLMKube and InferCost are appreciated and signal the kind of validation that helps prioritize the next round of work.
Top comments (0)