How to Actually Run an LLM on Almost No RAM

#llm #machinelearning #optimization #hardware

Someone on Reddit recently posted a photo of an LLM running on a 1998 iMac G3 with 32 MB of RAM. My first reaction was "no way." My second reaction was "okay, but how?"

That question sent me down a rabbit hole of model quantization, tiny architectures, and just how far you can push inference on absurdly constrained hardware. Whether you're trying to run a model on a Raspberry Pi, an old laptop, or just want to understand the actual floor for LLM inference, here's what I learned.

The Problem: LLMs Are Memory Hogs

The typical advice for running LLMs locally assumes you have a modern GPU with 8+ GB of VRAM, or at minimum a machine with 16 GB of system RAM. That's fine if you're running Llama 3 or Mistral on your M-series MacBook. But what if you're working with something far more constrained?

Maybe you want to run inference on an edge device. Maybe you're building for embedded systems. Or maybe you just want to see how small you can go for the sheer fun of it. The blocker is always the same: model weights don't fit in memory.

A 7B parameter model in FP16 needs roughly 14 GB just for the weights. That's before you account for the KV cache, activations, and the runtime itself. On a machine with 32 MB of RAM, you're off by about three orders of magnitude.

So how do you bridge that gap?

Step 1: Pick a Truly Tiny Model

Forget 7B. Forget 3B. You need to go much smaller. A few models that exist in the sub-500M parameter range:

SmolLM (135M parameters) — Hugging Face's compact model family
TinyLlama (1.1B) — still too large for extreme constraints, but a good mid-ground
GPT-2 Small (124M) — the OG small transformer
TinyStories models (~30M and under) — trained specifically to generate coherent short stories

For truly extreme environments, you're looking at models in the 15M-135M range. The quality won't blow your mind, but coherent text generation is absolutely possible.

# Quick check: how much RAM does a model actually need?
import sys

def estimate_memory_bytes(param_count, bits_per_param=16):
    """Estimate raw weight size — doesn't include runtime overhead"""
    bytes_per_param = bits_per_param / 8
    return param_count * bytes_per_param

# GPT-2 Small at full precision
fp16_size = estimate_memory_bytes(124_000_000, bits_per_param=16)
print(f"FP16: {fp16_size / 1e6:.0f} MB")  # ~248 MB — still too big

# Same model quantized to 4-bit
q4_size = estimate_memory_bytes(124_000_000, bits_per_param=4)
print(f"Q4:   {q4_size / 1e6:.0f} MB")    # ~62 MB — getting closer

# Quantized to 2-bit
q2_size = estimate_memory_bytes(124_000_000, bits_per_param=2)
print(f"Q2:   {q2_size / 1e6:.0f} MB")    # ~31 MB — now we're talking

That math is the key insight. A 124M parameter model at 2-bit quantization fits in roughly 31 MB. Tight, but physically possible on a 32 MB machine if you're clever about the runtime.

Step 2: Quantize Aggressively

Quantization is the process of reducing the precision of model weights. Instead of storing each weight as a 16-bit or 32-bit float, you represent it with fewer bits. The GGUF format (used by llama.cpp) supports several quantization levels:

Quant Type	Bits per Weight	Quality Impact
F16	16	None
Q8_0	8	Minimal
Q4_0	4	Noticeable
Q2_K	2-3	Significant
IQ2_XXS	~2	Heavy

The IQ2_XXS and Q2_K quant types from llama.cpp are where extreme compression lives. You will lose quality. The model will hallucinate more, repeat itself, and occasionally produce gibberish. But it will run.

# Convert and quantize a model using llama.cpp
# First, clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Convert a HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/smolLM-135M/ \
    --outfile smolLM-135M-f16.gguf

# Quantize to Q2_K — aggressive but tiny
./llama-quantize smolLM-135M-f16.gguf \
    smolLM-135M-q2_k.gguf Q2_K

# Check the final size
ls -lh smolLM-135M-q2_k.gguf
# Expect something in the 40-60 MB range for 135M params

For the absolute smallest footprint, you'd want to start with a sub-100M parameter model and quantize to IQ2_XXS. That can get you into the 20-30 MB range for the weights alone.

Step 3: Minimize the Runtime

The model weights are only half the battle. You also need an inference engine that doesn't eat all your remaining memory. This is where llama.cpp shines — it's written in C/C++ with minimal dependencies and has been ported to an absurd number of platforms.

Key flags for memory-constrained inference:

# Run with minimal memory allocation
./llama-cli -m smolLM-135M-q2_k.gguf \
    -c 64 \       # Tiny context window — less KV cache memory
    -b 1 \        # Batch size of 1 — minimum memory for processing
    -t 1 \        # Single thread — less stack/scheduling overhead
    -n 50 \       # Generate only 50 tokens
    --no-mmap \   # Disable memory mapping if causing issues
    -p "Once upon a time"

The context window (-c) is critical. Each token in the context requires memory for the key-value cache. On a machine with virtually no RAM, you might need to set this as low as 32 or 64 tokens. That means the model can barely "remember" a sentence or two, but it'll still generate text token by token.

Step 4: Dealing with Ancient Architectures

If you're actually targeting old hardware like a PowerPC iMac G3, you've got additional hurdles:

No SSE/AVX/NEON: Modern SIMD instructions don't exist. Everything is scalar math, which means inference is glacially slow. Think seconds per token, possibly minutes.
Cross-compilation: You'll likely need to cross-compile llama.cpp on a modern machine targeting the old architecture. GCC still supports PowerPC, so this is doable.
Memory alignment: Older systems can be picky about aligned memory access. You may need to patch the inference code.
Swap as RAM: With 32 MB of physical RAM, the OS itself takes a chunk. You'll almost certainly be swapping to disk, which on a 1998-era hard drive means painfully slow page faults.

Will the output be good? No. Will it be fast? Absolutely not. But "technically running" is still running.

The Practical Takeaway

You probably aren't trying to run inference on a 26-year-old computer. But the techniques here — aggressive quantization, tiny models, minimal context windows — are directly applicable to real-world edge deployment.

Things I'd actually use this knowledge for:

Raspberry Pi inference — A Pi Zero 2 W has 512 MB of RAM. A Q4-quantized SmolLM-135M runs comfortably there.
IoT and embedded applications — Simple text classification or short-form generation on microcontrollers with 64-256 MB RAM.
Offline/air-gapped systems — When you can't call an API and need local inference on whatever hardware is available.
Cost reduction — Running smaller quantized models on cheaper instances instead of GPU compute.

Prevention: Know Your Memory Budget Up Front

Before you pick a model for any constrained environment, do the math:

Count available RAM after the OS and your application take their share
Estimate weight size at your target quantization level (params × bits ÷ 8)
Add 20-40% overhead for the runtime, KV cache, and activations
If it doesn't fit, go smaller on the model or more aggressive on quantization — there is no magic trick that avoids this arithmetic

The exciting part is that the floor keeps dropping. Two years ago, running any coherent language model under 100 MB felt impossible. Now there are purpose-built tiny models that produce surprisingly readable output in under 50 MB.

Someone got an LLM running on a machine from 1998. The output was probably terrible and it probably ran at one token every few seconds. But the fact that it's possible at all tells you something about where this technology is headed — and it's not just toward bigger models.