DEV Community

Rashi Dashore
Rashi Dashore

Posted on

Running Llama Models Locally with Docker

I've been experimenting with running large language models entirely on my own machine, and the setup turned out to be simpler than I expected. Here's exactly what I did to get Llama 3 running locally using Docker - no cloud API, no data leaving my machine.

Why Local Inference?

The first thing I noticed after switching to local inference was the privacy gain. Every prompt I send stays on my machine. For projects involving sensitive data, internal documents, customer queries, proprietary code, that matters. There's no third-party logging, no rate limits, and no per-token cost.

Beyond privacy, running models locally gives you full control over the model version, the inference parameters, and the runtime environment. Cloud APIs abstract all of that away. Whenever tweak temperature or context length is needed for a specific task, I could do it directly without navigating a provider dashboard. Local inference also means your application keeps working even when an external API goes down — a real advantage in production workflows.

Minimum System Requirements

Before starting, make sure your machine meets these minimums:

  • RAM: 8 GB minimum (16 GB recommended for smooth performance)
  • Disk: At least 10 GB free storage
  • CPU: Modern multi-core processor (x86_64 or ARM64)
  • GPU: Optional but significantly speeds up inference; NVIDIA with CUDA or Apple Silicon with Metal both work
  • OS: Linux, macOS, or Windows with WSL2

Working?

Ollama is a lightweight runtime that handles model loading, quantization, and serving over a local HTTP API. Wrapping it in Docker makes the setup portable and isolated — the model files, config, and server all live inside a named volume, separate from your system. Docker also means you can spin this up on any machine with a single command, no manual dependency installs.

Set up

I have used Ollama inside Docker, which packages the model runtime cleanly. Created a docker-compose.yml to make the setup reproducible:

# docker-compose.yml
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:
Enter fullscreen mode Exit fullscreen mode

Then I pulled and ran Llama 3:

docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama run llama3
Enter fullscreen mode Exit fullscreen mode

I added a simple Python client to query it programmatically:

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3",
    "prompt": "Summarize the key risks in this contract clause: ...",
    "stream": False
})
print(response.json()["response"])
Enter fullscreen mode Exit fullscreen mode

Example Prompts:

  • "Explain this Python stack trace and suggest a fix."
  • "Summarize this 500-word document in three bullet points."

Response latency on the 8B model was 2–4 seconds per query — fast enough for interactive use.


Resource Quick Reference

Model RAM Required Disk Space
Llama 3 8B ~6 GB ~4.7 GB
Llama 3 70B ~48 GB ~40 GB

Running Llama locally with Docker took me under 15 minutes to configure, and it's now part of my standard dev environment for any task where keeping data private is non-negotiable.

Have you tried running llama models locally? How was your experience?

Top comments (0)