DEV Community

Zied Hamdi
Zied Hamdi

Posted on

Running AirLLM Locally on Apple Silicon: Not So Good

This week, armed with an article on huggingface talking about how AirLLM can run 70b models on 4GB of GPU, I thought. my M4 MacBook Pro (48GB RAM) could run a local coding assistant. Especially that Kilo Code discontinued the free xGrog AI, the timing was as falling from the sky.

The Goal

Create a local xGrog replacement after xGrog became paid on Kilo code. I wanted to run models like:

  • CodeLlama 13B/70B
  • DeepSeek-OCR models
  • Llama 3.1 variants

Funny fact: I didn't have to type all the commands and python files by hand, just asking deepseek to do it for me. And giving directions on failures.

What Actually Worked

βœ… Success: AirLLM with Mistral 7B

Using AirLLM's MLX backend, I successfully ran Mistral 7B without quantization. The key was proper PyTorch-to-MLX tensor conversion:

# First, upgrade pip
pip install --upgrade pip

# run python in virtualized env
python -m venv venv  
source venv/bin/activate 

# Install airllm
pip install airllm

# Install PyTorch for Mac (MPS support)
pip install torch torchvision torchaudio

# Optional: for better performance
pip install accelerate

# had to add these (compared to official docs)
pip install mlx-lm 
pip install transformers
pip install safetensors
Enter fullscreen mode Exit fullscreen mode
# airllm_working_mac.py
from airllm import AutoModel
import mlx.core as mx

print("πŸš€ AirLLM on Apple Silicon - Working Version")
print("="*60)

# Load model
model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
print("βœ… Model loaded successfully")

def generate_response(prompt, max_tokens=100):
    """Generate a response using AirLLM on Mac"""

    # Tokenize
    input_tokens = model.tokenizer(
        [prompt],
        return_tensors="pt",
        return_attention_mask=False,
        truncation=True,
        max_length=512,
        padding=False
    )

    # Convert PyTorch β†’ NumPy β†’ MLX
    pt_tensor = input_tokens['input_ids']
    numpy_array = pt_tensor.cpu().numpy()
    mlx_array = mx.array(numpy_array, dtype=mx.int32)

    print(f"πŸ“ Input length: {mlx_array.shape[1]} tokens")

    # Generate - returns STRING directly
    response = model.generate(
        mlx_array,
        max_new_tokens=max_tokens,
        use_cache=True,
        temperature=0.7
    )

    # Clean up the response (remove extra end tokens)
    cleaned_response = response.replace('</s>', '').strip()
    return cleaned_response

# Test it
test_prompts = [
    "What is the capital of France?",
    "Write a Python function to calculate factorial",
    "Explain quantum computing in simple terms"
]

for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*60}")
    print(f"TEST {i+1}: {prompt}")
    print(f"{'='*60}")

    response = generate_response(prompt, max_tokens=50)
    print(f"πŸ€– Response:\n{response}")

    # Save to file
    with open(f"response_{i+1}.txt", "w") as f:
        f.write(f"Prompt: {prompt}\n\nResponse: {response}")

print(f"\n{'='*60}")
print("βœ… All tests completed! Responses saved to response_*.txt")
print(f"{'='*60}")
Enter fullscreen mode Exit fullscreen mode

Performance: Every question took ~10 minutes, subsequent runs were not faster.

AirLLM takes time on each question

What about bigger models?

❌ Quantization Woes

The biggest limitation: bitsandbytes doesn't work on Mac. This meant:

  1. No 4-bit compression for larger models
  2. 48GB RAM limits us to ~13B models max without quantization
  3. mlx-community models (pre-quantized) had loading issues
# This fails on Mac - bitsandbytes needs CUDA
model = AutoModel.from_pretrained("Phind-CodeLlama-34B-v2", compression='4bit')
# Error: "Torch not compiled with CUDA enabled"
Enter fullscreen mode Exit fullscreen mode

❌ File Format Frustrations

Downloaded mlx-community/CodeLlama-13b-Python-4bit-MLX (7GB) but hit:

FileNotFoundError: No safetensors found in [path]
Enter fullscreen mode Exit fullscreen mode

The pre-converted MLX models didn't match MLX-LM's expected file structure.

❌ Performance Reality

Even with successful 7B models:

  • 8.5 minutes for initial AirLLM layer splitting
  • Significant disk space needed for cache (20+ GB)
  • Memory pressure even with "small" models

The Architecture Limitation

Apple's MLX framework is promising but immature:

  • No native quantization support (relies on bitsandbytes)
  • Model ecosystem is spotty
  • Documentation gaps for edge cases

Conclusion:

Hosted Solutions Still Win for now. The Mac Studio (M3 Ultra, 192GB RAM) looks like a good silent AI workstation at home. But:

  • Same quantization issues apply
  • Still limited by MLX ecosystem maturity
  • Cost: $7,000+ for hardware vs $20/month cloud

Lessons Learned

  1. AirLLM has significant run overhead, a very "smart" model could compensate for the loading time by executing a complex task, but there are many bottelnecks to only getting it running
  2. Apple Silicon needs native quantization tools
  3. The ecosystem is moving fast but isn't production-ready
  4. For serious work, cloud solutions still dominate

What else I tried on my Mac?

Ollama with smaller models:

brew install ollama
ollama run codellama:13b-python
Enter fullscreen mode Exit fullscreen mode

It just works. No tensor conversions, no quantization headaches, but is very slow: of no help in the real world tasks, and killed memory consumption.
(there are articles out there on how to run it, if you're curious)

So the biggest challenge to learn AI model manipulation today is the hardware entry ticket.
Any alternatives you tried on your side? Tell me in the comments

Top comments (0)