A new project just hit Hacker News at 194+ points: Hypura — a storage-tier-aware LLM inference scheduler specifically for Apple Silicon.
This is significant because it addresses the biggest limitation of running LLMs locally on Mac: memory management.
The Problem
Running a 70B parameter model on a MacBook Pro:
| Model | RAM Needed | M3 Max (96GB) | M4 Ultra (192GB) |
|---|---|---|---|
| Llama 3 8B | 8GB | ✅ Fast | ✅ Fast |
| Llama 3 70B | 40GB | ⚠️ Slow (swap) | ✅ Fast |
| Mixtral 8x22B | 88GB | ❌ Won't fit | ⚠️ Tight |
| Llama 3 405B | 200GB+ | ❌ | ❌ |
Apple's unified memory is great, but when models exceed available RAM, inference falls off a cliff.
What Hypura Does
Hypura is a scheduler that's aware of Apple Silicon's storage tiers — it intelligently manages which model layers live in:
- Unified Memory (fastest)
- SSD swap (slower but huge)
- Compressed memory (middle ground)
This means you can run larger models than your RAM should allow, with better performance than naive swap.
Why This Matters for Developers
1. Local LLM development gets more practical
If you're building AI-powered tools, local inference means:
- No API costs
- No rate limits
- No data leaving your machine
- Works offline
2. Apple Silicon is becoming an AI platform
With ARM announcing their AGI CPU and Apple increasingly supporting ML:
# The Apple Silicon ML stack in 2026:
# - MLX (Apple's ML framework)
# - Core ML (on-device inference)
# - Hypura (intelligent scheduling)
# - Metal (GPU compute)
# - Neural Engine (dedicated AI silicon)
This is no longer a "NVIDIA or nothing" world for AI development.
3. Cost comparison
| Option | Monthly Cost | Latency |
|---|---|---|
| GPT-4o API (1M tokens/day) | ~$150 | 50-200ms |
| Claude Sonnet API | ~$90 | 50-150ms |
| Mac Studio M4 Ultra (one-time) | $0/mo (after $4K purchase) | 30-100ms |
| Cloud GPU (A100) | ~$900/mo | 20-50ms |
If you're making 1M+ tokens/day, local inference on Apple Silicon pays for itself in months.
Getting Started With Local LLMs on Mac
If you haven't tried local inference yet:
# Option 1: Ollama (easiest)
brew install ollama
ollama run llama3:8b
# Option 2: MLX (Apple's framework, fastest on M-series)
pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit
# Option 3: llama.cpp (most flexible)
brew install llama.cpp
Discussion
- Are you running LLMs locally? What hardware and framework?
- Has local inference replaced cloud APIs for any of your use cases?
- Is Apple Silicon becoming a legitimate AI development platform?
- What model size do you actually need for your work? (Most people oversize)
I run a 8B model locally for code generation and it handles 90% of what I need. The other 10% still goes to cloud APIs.
Top comments (0)