Introduction
Running large language models locally can feel like trying to power a cathedral with a single AA battery—especially on an 8 GB Mac M1. Fortunately, TinyLlama (1.1 B parameters, 4-bit quantized) and the llama.cpp Docker “server” make it dead simple. In this guide, you’ll learn how to:
- Download the TinyLlama Q4_0 model
- Pull the ARM64 llama.cpp server image
- Mount & run TinyLlama inside Docker
- Send your first prompt
Step 1: Download the Quantized Model
First, grab the 0.6 GB quantized weights from Hugging Face and save them into ~/models
:
huggingface-cli download \
TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
--include '*Q4_0.gguf' \
--local-dir ~/models \
--local-dir-use-symlinks False
If you don't have huggingface-cli
you can do:
# If you’re using Python3’s pip:
pip3 install --upgrade huggingface-hub
# Or via the standalone package (older name):
pip3 install --upgrade huggingface-cli
Once installed, verify with:
huggingface-cli --help
Step 2: Pull the Docker Image
Fetch the ARM64-native llama.cpp server—no emulation required:
docker pull ghcr.io/ggerganov/llama.cpp:server-b4646@sha256:645767ffdc357b440d688f61bd752808a339f08dd022cc19d552f53b2c612853
Step 3: Run the llama.cpp Server
Assuming you’ve already placed the quantized TinyLlama model at ~/models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf
, launch the container in the foreground:
docker run --rm -it \
--name tinyllama \
--platform=linux/arm64/v8 \
-v ~/models:/models \
-p 8000:8000 \
ghcr.io/ggerganov/llama.cpp:server-b4646@sha256:645767ffdc357b440d688f61bd752808a339f08dd022cc19d552f53b2c612853 \
-m /models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
--host 0.0.0.0 \
--port 8000 \
-n 512
--rm -it
keeps it clean and interactive--platform=linux/arm64/v8
forces the native build on M1-n 512
caps responses to 512 tokens
You should see:
server listening at http://0.0.0.0:8000
Step 4: Query the Model
In a second terminal, send an instruction-style prompt to
/v1/completions
:
curl http://localhost:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "tinyllama-1.1b-chat-v1.0",
"prompt": "### Instruction:\nExplain OOP programming simply.\n\n### Response:",
"max_tokens": 128
}'
You’ll receive a JSON payload with your answer under choices[0].text
.
Troubleshooting Tips
Blank or gibberish responses?
Wrap your prompt in the### Instruction:…### Response:
template.Out-of-memory?
TinyLlama Q4_0 uses ~0.6 GiB in-container. If you see OOMs on larger models, either bump Docker’s memory in Preferences → Resources → Memory or stick to this tiny variant.
Conclusion
You’ve just transformed your “potato computer” into a local LLM server! With a single Docker command and a quantized TinyLlama model, you’re free to experiment with chatbots, integrations in Node.js/Next.js, or offline AI demos—no cloud GPUs required. Happy hacking! 🚀🥔
Top comments (1)
Thanks for the article. How good is Tiny compared to a lightweight version of Deepseek?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.