This week, armed with an article on huggingface talking about how AirLLM can run 70b models on 4GB of GPU, I thought. my M4 MacBook Pro (48GB RAM) could run a local coding assistant. Especially that Kilo Code discontinued the free xGrog AI, the timing was as falling from the sky.
The Goal
Create a local xGrog replacement after xGrog became paid on Kilo code. I wanted to run models like:
- CodeLlama 13B/70B
- DeepSeek-OCR models
- Llama 3.1 variants
Funny fact: I didn't have to type all the commands and python files by hand, just asking deepseek to do it for me. And giving directions on failures.
What Actually Worked
β Success: AirLLM with Mistral 7B
Using AirLLM's MLX backend, I successfully ran Mistral 7B without quantization. The key was proper PyTorch-to-MLX tensor conversion:
# First, upgrade pip
pip install --upgrade pip
# run python in virtualized env
python -m venv venv
source venv/bin/activate
# Install airllm
pip install airllm
# Install PyTorch for Mac (MPS support)
pip install torch torchvision torchaudio
# Optional: for better performance
pip install accelerate
# had to add these (compared to official docs)
pip install mlx-lm
pip install transformers
pip install safetensors
# airllm_working_mac.py
from airllm import AutoModel
import mlx.core as mx
print("π AirLLM on Apple Silicon - Working Version")
print("="*60)
# Load model
model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
print("β
Model loaded successfully")
def generate_response(prompt, max_tokens=100):
"""Generate a response using AirLLM on Mac"""
# Tokenize
input_tokens = model.tokenizer(
[prompt],
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=512,
padding=False
)
# Convert PyTorch β NumPy β MLX
pt_tensor = input_tokens['input_ids']
numpy_array = pt_tensor.cpu().numpy()
mlx_array = mx.array(numpy_array, dtype=mx.int32)
print(f"π Input length: {mlx_array.shape[1]} tokens")
# Generate - returns STRING directly
response = model.generate(
mlx_array,
max_new_tokens=max_tokens,
use_cache=True,
temperature=0.7
)
# Clean up the response (remove extra end tokens)
cleaned_response = response.replace('</s>', '').strip()
return cleaned_response
# Test it
test_prompts = [
"What is the capital of France?",
"Write a Python function to calculate factorial",
"Explain quantum computing in simple terms"
]
for i, prompt in enumerate(test_prompts):
print(f"\n{'='*60}")
print(f"TEST {i+1}: {prompt}")
print(f"{'='*60}")
response = generate_response(prompt, max_tokens=50)
print(f"π€ Response:\n{response}")
# Save to file
with open(f"response_{i+1}.txt", "w") as f:
f.write(f"Prompt: {prompt}\n\nResponse: {response}")
print(f"\n{'='*60}")
print("β
All tests completed! Responses saved to response_*.txt")
print(f"{'='*60}")
Performance: Every question took ~10 minutes, subsequent runs were not faster.
What about bigger models?
β Quantization Woes
The biggest limitation: bitsandbytes doesn't work on Mac. This meant:
- No 4-bit compression for larger models
- 48GB RAM limits us to ~13B models max without quantization
- mlx-community models (pre-quantized) had loading issues
# This fails on Mac - bitsandbytes needs CUDA
model = AutoModel.from_pretrained("Phind-CodeLlama-34B-v2", compression='4bit')
# Error: "Torch not compiled with CUDA enabled"
β File Format Frustrations
Downloaded mlx-community/CodeLlama-13b-Python-4bit-MLX (7GB) but hit:
FileNotFoundError: No safetensors found in [path]
The pre-converted MLX models didn't match MLX-LM's expected file structure.
β Performance Reality
Even with successful 7B models:
- 8.5 minutes for initial AirLLM layer splitting
- Significant disk space needed for cache (20+ GB)
- Memory pressure even with "small" models
The Architecture Limitation
Apple's MLX framework is promising but immature:
- No native quantization support (relies on bitsandbytes)
- Model ecosystem is spotty
- Documentation gaps for edge cases
Conclusion:
Hosted Solutions Still Win for now. The Mac Studio (M3 Ultra, 192GB RAM) looks like a good silent AI workstation at home. But:
- Same quantization issues apply
- Still limited by MLX ecosystem maturity
- Cost: $7,000+ for hardware vs $20/month cloud
Lessons Learned
- AirLLM has significant run overhead, a very "smart" model could compensate for the loading time by executing a complex task, but there are many bottelnecks to only getting it running
- Apple Silicon needs native quantization tools
- The ecosystem is moving fast but isn't production-ready
- For serious work, cloud solutions still dominate
What else I tried on my Mac?
Ollama with smaller models:
brew install ollama
ollama run codellama:13b-python
It just works. No tensor conversions, no quantization headaches, but is very slow: of no help in the real world tasks, and killed memory consumption.
(there are articles out there on how to run it, if you're curious)
So the biggest challenge to learn AI model manipulation today is the hardware entry ticket.
Any alternatives you tried on your side? Tell me in the comments

Top comments (0)