DEV Community

Darwin Manalo
Darwin Manalo

Posted on

How to Run a Tiny LLM in a Potato Computer

Introduction

Running large language models locally can feel like trying to power a cathedral with a single AA battery—especially on an 8 GB Mac M1. Fortunately, TinyLlama (1.1 B parameters, 4-bit quantized) and the llama.cpp Docker “server” make it dead simple. In this guide, you’ll learn how to:

  1. Download the TinyLlama Q4_0 model
  2. Pull the ARM64 llama.cpp server image
  3. Mount & run TinyLlama inside Docker
  4. Send your first prompt

Step 1: Download the Quantized Model

First, grab the 0.6 GB quantized weights from Hugging Face and save them into ~/models:

huggingface-cli download \
  TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  --include '*Q4_0.gguf' \
  --local-dir ~/models \
  --local-dir-use-symlinks False
Enter fullscreen mode Exit fullscreen mode

If you don't have huggingface-cli you can do:

# If you’re using Python3’s pip:
pip3 install --upgrade huggingface-hub

# Or via the standalone package (older name):
pip3 install --upgrade huggingface-cli
Enter fullscreen mode Exit fullscreen mode

Once installed, verify with:

huggingface-cli --help
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull the Docker Image

Fetch the ARM64-native llama.cpp server—no emulation required:

docker pull ghcr.io/ggerganov/llama.cpp:server-b4646@sha256:645767ffdc357b440d688f61bd752808a339f08dd022cc19d552f53b2c612853
Enter fullscreen mode Exit fullscreen mode

Step 3: Run the llama.cpp Server

Assuming you’ve already placed the quantized TinyLlama model at ~/models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf, launch the container in the foreground:

docker run --rm -it \
  --name tinyllama \
  --platform=linux/arm64/v8 \
  -v ~/models:/models \
  -p 8000:8000 \
  ghcr.io/ggerganov/llama.cpp:server-b4646@sha256:645767ffdc357b440d688f61bd752808a339f08dd022cc19d552f53b2c612853 \
    -m /models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    -n 512
Enter fullscreen mode Exit fullscreen mode
  • --rm -it keeps it clean and interactive

  • --platform=linux/arm64/v8 forces the native build on M1

  • -n 512 caps responses to 512 tokens

You should see:
server listening at http://0.0.0.0:8000

Step 4: Query the Model

In a second terminal, send an instruction-style prompt to
/v1/completions:

curl http://localhost:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tinyllama-1.1b-chat-v1.0",
    "prompt": "### Instruction:\nExplain OOP programming simply.\n\n### Response:",
    "max_tokens": 128
  }'
Enter fullscreen mode Exit fullscreen mode

You’ll receive a JSON payload with your answer under choices[0].text.

Troubleshooting Tips

  • Blank or gibberish responses?
    Wrap your prompt in the ### Instruction:…### Response: template.

  • Out-of-memory?
    TinyLlama Q4_0 uses ~0.6 GiB in-container. If you see OOMs on larger models, either bump Docker’s memory in Preferences → Resources → Memory or stick to this tiny variant.

Conclusion

You’ve just transformed your “potato computer” into a local LLM server! With a single Docker command and a quantized TinyLlama model, you’re free to experiment with chatbots, integrations in Node.js/Next.js, or offline AI demos—no cloud GPUs required. Happy hacking! 🚀🥔

Top comments (1)

Collapse
 
nube_colectiva_nc profile image
Nube Colectiva

Thanks for the article. How good is Tiny compared to a lightweight version of Deepseek?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.