Posted on Mar 5

Your Mac Is a Supercomputer. It's Time We Benchmarked It Like One.

#macos #opensource #llm #machinelearning

Why open source local AI benchmarking on Apple Silicon matters - and why your benchmark submission is more valuable than you think.

The narrative around AI has been almost entirely cloud-centric. You send a prompt to a data center somewhere, tokens come back, and you pretend not to think about what that costs in latency, money, or privacy. For a long time, that was the only game in town.

That's changing fast.

Apple Silicon - from the M1 to the M4 Pro/ Max shipping in machines today, with M5 Max on the horizon - has quietly become one of the most capable local AI compute platforms on the planet. The unified memory architecture means an M4 Max with 128GB of RAM can run models that would require a dedicated GPU workstation in any other form factor. At laptop wattages. Silently. Offline. Without sending a single token to a third party.

This isn't a niche enthusiast story anymore. It's a real shift in how developers, researchers, and privacy-conscious professionals are choosing to run AI workloads. And it comes with a problem we haven't solved yet: we don't have great, shared, community-driven data on how these machines actually perform in the wild.

That's what I built Anubis OSS to help fix.
App Page

The Fragmented Local LLM Ecosystem

If you've spent time running local models on macOS, you've felt this friction. The tooling is scattered and siloed:

Chat wrappers like Ollama, LM Studio, and Jan are excellent at what they do — conversation - but they're not built for systematic performance testing.
Hardware monitors like asitop, macmon, and mactop give you a beautiful CLI view of GPU and CPU utilization, but they have no concept of what the LLM is doing, which model is loaded, or what the prompt context size is.
Eval frameworks like promptfoo require YAML configs and terminal fluency that puts them out of reach for a lot of practitioners.

None of these tools correlate hardware behavior with inference performance in a meaningful, accessible way. You can watch your GPU spike during a generation pass, but you can't easily answer: Is Gemma 3 12B Q4_K_M more watt-efficient than Mistral Small 3.1 on an M3 Pro? How does TTFT scale with context length on an M4 with 32GB vs. 64GB? Which quantization gives the best tokens-per-watt on the Neural Engine?

Anubis answers those questions. It's a native SwiftUI app — no Electron, no Python runtime, no external dependencies — that runs benchmark sessions against any OpenAI-compatible backend (Ollama, LM Studio, mlx-lm, vLLM, and more) while simultaneously pulling real hardware telemetry via IOReport: GPU utilization, CPU utilization, GPU/CPU/ANE/DRAM power in watts, GPU frequency, process memory including Metal allocations, and thermal state.

Every run is logged, exportable as CSV or Markdown, and — optionally — submittable to the community leaderboard.

Why the Open Dataset Is the Real Story

The leaderboard submissions aren't just a scoreboard. They're the beginning of something more interesting: a real-world, community-sourced performance dataset across diverse Apple Silicon configurations, model families, quantizations, and backends.

Think about what that dataset contains:

Tokens per second across M1, M2, M3, M4, M5+ chips with varying unified memory configurations
Time to first token (TTFT) as a function of prompt length and model size
Watts-per-token efficiency across quantization levels (Q4, Q5, Q8, fp16) for the same model family
Backend-specific variance — the same model, same chip, different runner (Ollama vs. mlx-lm vs. LM Studio)
Thermal throttling behavior under sustained inference loads

This data is hard to get any other way. Formal benchmarks from chipmakers are synthetic. Reviewer benchmarks cover a handful of models on a handful of chips. Nobody has the time or hardware budget to run a comprehensive cross-product matrix.

But collectively, the community does.

How Model Runner Developers Can Use This

If you're building or maintaining a backend like Ollama, LM Studio, or mlx-lm, community benchmark data tells you things your internal testing can't:

Which chip/memory configurations are underperforming relative to their theoretical bandwidth? If an M3 Pro with 36GB is consistently underperforming an M2 Max with the same VRAM on a specific model class, that's a signal worth investigating in your memory management or Metal compute path.
Where is TTFT worst? Time to first token is often the user-perceived latency that matters most. If community data shows TTFT degrading sharply on longer contexts for certain quantizations, that's a tuning opportunity.
What's the real-world power envelope? Synthetic benchmarks don't capture sustained thermal behavior. Community submissions do.

How Model Tuners and Quantization Authors Can Use This

The dataset is equally valuable if you're working on GGUF quantizations, MLX conversions, or fine-tuned adapters:

Quantization efficiency curves across real hardware tell you where the quality/performance tradeoff actually lands for end users, not just on a benchmark server.
ANE utilization patterns — which quantization levels or architectures make better use of the Neural Engine — are nearly invisible without this kind of community telemetry at scale.
Memory footprint data (including Metal/GPU allocations tracked via proc_pid_rusage) shows whether your quantization is actually reducing real-world memory pressure or just parameter count.

If you're shipping a new quantization of a popular model and want to understand how it performs across the installed base of Apple Silicon hardware your users actually own, this dataset is the closest thing to field telemetry you'll have access to.

The Apple Silicon Trajectory Makes This Urgent

M5 is coming. The M4 Ultra hasn't even shipped in the Mac Pro yet. The memory ceiling on Apple Silicon keeps rising — 192GB on the current M4 Max configuration means models that were cloud-only a year ago can run locally today.

Each generation, the gap between "what you can run locally" and "what you need the cloud for" narrows. We're already past the inflection point for most 7B–13B models. We're approaching it for 30B–70B classes on high-end configs. The 100B+ frontier is a matter of time and memory density.

The decisions that model runners and quantization authors make right now - about memory management, about Metal optimization, about ANE scheduling — will determine how well that new hardware gets utilized by the community. Having a community dataset that shows real-world performance gaps is one of the best early-warning systems we have.

Why You Should Participate

Running a benchmark in Anubis takes about two minutes. Submitting it to the leaderboard takes one click. But here's why it's worth your time beyond the leaderboard:

Your hardware configuration is probably underrepresented. The M4 Pro with 48GB, the M2 Max with 96GB, the M3 Ultra — the matrix of chip × memory × thermal environment × backend is enormous. Every submission fills in a cell of that matrix that nobody else may have covered.

Your workload patterns matter. If you're running long-context inference, or using a backend that others aren't, or benchmarking a model that just dropped — that data is genuinely novel and useful to the ecosystem.

The dataset is open. This isn't data that disappears into a corporate analytics pipeline. It's a community resource, available for anyone building tools, writing research, or trying to optimize for the platform.

The project needs the stars. Anubis OSS is working toward 75 GitHub stars to qualify for Homebrew Cask distribution, which would make it dramatically easier for people to install and run. If you find value in what the project is doing, a star is a genuinely meaningful contribution to its reach.

Get Started

Download Anubis OSS from the latest GitHub release — it's a notarized macOS app, no build required
Run a benchmark — load any model in your preferred backend, pick a prompt preset, hit run
Submit your results to the community leaderboard
Star the repo at github.com/uncSoft/anubis-oss to help us hit Homebrew distribution

The local AI era is here. Let's build the shared infrastructure to understand it.

Anubis OSS is GPL-3.0 licensed. Built in Swift, no external dependencies, privacy-first — your benchmark data is submitted voluntarily and never includes anything beyond hardware specs and model performance metrics. A limited version is also available as part of The Architect's Toolkit bundle on the Mac App Store.

Questions or contributions? Open an issue or PR on GitHub.