Zied Hamdi

Posted on Jan 28

Running AirLLM Locally on Apple Silicon: Not So Good

#airllm #m4 #mlx #huggingface

This week, armed with an article on huggingface talking about how AirLLM can run 70b models on 4GB of GPU, I thought. my M4 MacBook Pro (48GB RAM) could run a local coding assistant. Especially that Kilo Code discontinued the free xGrog AI, the timing was as falling from the sky.

The Goal

Create a local xGrog replacement after xGrog became paid on Kilo code. I wanted to run models like:

CodeLlama 13B/70B
DeepSeek-OCR models
Llama 3.1 variants

Funny fact: I didn't have to type all the commands and python files by hand, just asking deepseek to do it for me. And giving directions on failures.

What Actually Worked

✅ Success: AirLLM with Mistral 7B

Using AirLLM's MLX backend, I successfully ran Mistral 7B without quantization. The key was proper PyTorch-to-MLX tensor conversion:

# First, upgrade pip
pip install --upgrade pip

# run python in virtualized env
python -m venv venv  
source venv/bin/activate 

# Install airllm
pip install airllm

# Install PyTorch for Mac (MPS support)
pip install torch torchvision torchaudio

# Optional: for better performance
pip install accelerate

# had to add these (compared to official docs)
pip install mlx-lm 
pip install transformers
pip install safetensors

# airllm_working_mac.py
from airllm import AutoModel
import mlx.core as mx

print("🚀 AirLLM on Apple Silicon - Working Version")
print("="*60)

# Load model
model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
print("✅ Model loaded successfully")

def generate_response(prompt, max_tokens=100):
    """Generate a response using AirLLM on Mac"""

    # Tokenize
    input_tokens = model.tokenizer(
        [prompt],
        return_tensors="pt",
        return_attention_mask=False,
        truncation=True,
        max_length=512,
        padding=False
    )

    # Convert PyTorch → NumPy → MLX
    pt_tensor = input_tokens['input_ids']
    numpy_array = pt_tensor.cpu().numpy()
    mlx_array = mx.array(numpy_array, dtype=mx.int32)

    print(f"📝 Input length: {mlx_array.shape[1]} tokens")

    # Generate - returns STRING directly
    response = model.generate(
        mlx_array,
        max_new_tokens=max_tokens,
        use_cache=True,
        temperature=0.7
    )

    # Clean up the response (remove extra end tokens)
    cleaned_response = response.replace('</s>', '').strip()
    return cleaned_response

# Test it
test_prompts = [
    "What is the capital of France?",
    "Write a Python function to calculate factorial",
    "Explain quantum computing in simple terms"
]

for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*60}")
    print(f"TEST {i+1}: {prompt}")
    print(f"{'='*60}")

    response = generate_response(prompt, max_tokens=50)
    print(f"🤖 Response:\n{response}")

    # Save to file
    with open(f"response_{i+1}.txt", "w") as f:
        f.write(f"Prompt: {prompt}\n\nResponse: {response}")

print(f"\n{'='*60}")
print("✅ All tests completed! Responses saved to response_*.txt")
print(f"{'='*60}")

Performance: Every question took ~10 minutes, subsequent runs were not faster.

What about bigger models?

❌ Quantization Woes

The biggest limitation: bitsandbytes doesn't work on Mac. This meant:

No 4-bit compression for larger models
48GB RAM limits us to ~13B models max without quantization
mlx-community models (pre-quantized) had loading issues

# This fails on Mac - bitsandbytes needs CUDA
model = AutoModel.from_pretrained("Phind-CodeLlama-34B-v2", compression='4bit')
# Error: "Torch not compiled with CUDA enabled"

❌ File Format Frustrations

Downloaded mlx-community/CodeLlama-13b-Python-4bit-MLX (7GB) but hit:

FileNotFoundError: No safetensors found in [path]

The pre-converted MLX models didn't match MLX-LM's expected file structure.

❌ Performance Reality

Even with successful 7B models:

8.5 minutes for initial AirLLM layer splitting
Significant disk space needed for cache (20+ GB)
Memory pressure even with "small" models

The Architecture Limitation

Apple's MLX framework is promising but immature:

No native quantization support (relies on bitsandbytes)
Model ecosystem is spotty
Documentation gaps for edge cases

Conclusion:

Hosted Solutions Still Win for now. The Mac Studio (M3 Ultra, 192GB RAM) looks like a good silent AI workstation at home. But:

Same quantization issues apply
Still limited by MLX ecosystem maturity
Cost: $7,000+ for hardware vs $20/month cloud

Lessons Learned

AirLLM has significant run overhead, a very "smart" model could compensate for the loading time by executing a complex task, but there are many bottelnecks to only getting it running
Apple Silicon needs native quantization tools
The ecosystem is moving fast but isn't production-ready
For serious work, cloud solutions still dominate

What else I tried on my Mac?

Ollama with smaller models:

brew install ollama
ollama run codellama:13b-python

It just works. No tensor conversions, no quantization headaches, but is very slow: of no help in the real world tasks, and killed memory consumption.
(there are articles out there on how to run it, if you're curious)

So the biggest challenge to learn AI model manipulation today is the hardware entry ticket.
Any alternatives you tried on your side? Tell me in the comments

DEV Community