DEV Community

Alex Spinov
Alex Spinov

Posted on

Running LLMs on Apple Silicon Is Getting Serious — Hypura Scheduler (194pts on HN)

A new project just hit Hacker News at 194+ points: Hypura — a storage-tier-aware LLM inference scheduler specifically for Apple Silicon.

This is significant because it addresses the biggest limitation of running LLMs locally on Mac: memory management.

The Problem

Running a 70B parameter model on a MacBook Pro:

Model RAM Needed M3 Max (96GB) M4 Ultra (192GB)
Llama 3 8B 8GB ✅ Fast ✅ Fast
Llama 3 70B 40GB ⚠️ Slow (swap) ✅ Fast
Mixtral 8x22B 88GB ❌ Won't fit ⚠️ Tight
Llama 3 405B 200GB+

Apple's unified memory is great, but when models exceed available RAM, inference falls off a cliff.

What Hypura Does

Hypura is a scheduler that's aware of Apple Silicon's storage tiers — it intelligently manages which model layers live in:

  1. Unified Memory (fastest)
  2. SSD swap (slower but huge)
  3. Compressed memory (middle ground)

This means you can run larger models than your RAM should allow, with better performance than naive swap.

Why This Matters for Developers

1. Local LLM development gets more practical

If you're building AI-powered tools, local inference means:

  • No API costs
  • No rate limits
  • No data leaving your machine
  • Works offline

2. Apple Silicon is becoming an AI platform

With ARM announcing their AGI CPU and Apple increasingly supporting ML:

# The Apple Silicon ML stack in 2026:
# - MLX (Apple's ML framework)
# - Core ML (on-device inference)
# - Hypura (intelligent scheduling)
# - Metal (GPU compute)
# - Neural Engine (dedicated AI silicon)
Enter fullscreen mode Exit fullscreen mode

This is no longer a "NVIDIA or nothing" world for AI development.

3. Cost comparison

Option Monthly Cost Latency
GPT-4o API (1M tokens/day) ~$150 50-200ms
Claude Sonnet API ~$90 50-150ms
Mac Studio M4 Ultra (one-time) $0/mo (after $4K purchase) 30-100ms
Cloud GPU (A100) ~$900/mo 20-50ms

If you're making 1M+ tokens/day, local inference on Apple Silicon pays for itself in months.

Getting Started With Local LLMs on Mac

If you haven't tried local inference yet:

# Option 1: Ollama (easiest)
brew install ollama
ollama run llama3:8b

# Option 2: MLX (Apple's framework, fastest on M-series)
pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit

# Option 3: llama.cpp (most flexible)
brew install llama.cpp
Enter fullscreen mode Exit fullscreen mode

Discussion

  • Are you running LLMs locally? What hardware and framework?
  • Has local inference replaced cloud APIs for any of your use cases?
  • Is Apple Silicon becoming a legitimate AI development platform?
  • What model size do you actually need for your work? (Most people oversize)

I run a 8B model locally for code generation and it handles 90% of what I need. The other 10% still goes to cloud APIs.

Top comments (0)