Aditya Gupta

Posted on May 19

Running Quantized LLMs Locally: Unlocking Docker Model Runner's Potential

#programming #docker #llm #devops

Have you ever dreamed of running powerful Large Language Models (LLMs) on your own machine, free from cloud service complexity and major infrastructure costs? Docker Model Runner is an innovative solution to running quantized LLMs locally. Get set to leverage artificial intelligence capabilities directly from your desktop without needing a GPU cluster and without worrying about billing notifications.

Why Local Quantized LLMs?

As generative AI revolutionizes all levels of technological infrastructure, developers, startups, researchers, and hobbyists are now integrating AI into their internal tools and workflow, on top of their agents and chatbots. However, most cloud-based offerings:

Billing by token or minute is likely to result in mounting costs.
Require API keys and rate limits
Share confidential data with outside service providers.

Quantized local LLMs help overcome such challenges. Quantization is the process that reduces memory consumption and speeds up inference by compressing the model, for example, using int4 or int8 weights. When running locally, they:

Act quickly with little delay.
Protect your data's privacy and security.
Working without internet connectivity and with low operational costs.

It is perfect for developers testing ideas, privacy-conscious startups, and educators training students in AI.

Meet Docker Model Runner

Docker Model Runner is Docker's effort to solve the issues with local GenAI. It enables the following:

Run LLMs with a single command
Expose OpenAI-compatible APIs locally
Avoid Python dependency hell
Switch seamlessly among models. See it as your own personal AI runtime environment.

Prerequisites

Before you begin:

Install Docker Desktop (v4.40+)
Use a system with at least 8GB RAM (16GB+ recommended)
Basic CLI knowledge

Optional but helpful:

Familiarity with curl, Node.js, or Python

Quick Start: Running Gemma Locally

Google's Gemma 2B quantized model is a great starting point for local LLM experiments.

Step 1: Pull the Docker Image

docker pull dockermodelrunner/gemma:2b-q4

This downloads the quantized Gemma 2B model optimized for int4 precision.

Step 2: Run the Model

docker run -d \
  -p 8080:8080 \
  --name local-gemma \
  dockermodelrunner/gemma:2b-q4

This command starts up the model and assigns local port 8080 to it. The container then exposes an OpenAI-compatible API.

Step 3: Send Your First Prompt

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "Explain Docker Model Runner in 2 sentences.",
        "max_tokens": 60
      }'
You'll receive a JSON response like:
{
  "id": "cmpl-1",
  "object": "text_completion",
  "choices": [
    {
      "text": "Docker Model Runner is a tool that allows developers to run large language models locally. It simplifies deployment, improves privacy, and reduces cost."
    }
  ]
}

Switching Between Models

Try another model like Mistral:
docker run -d -p 8081:8080 dockermodelrunner/mistral:7b-q4

Use port 8081 to communicate with the Mistral container when running Gemma on port 8080 simultaneously.

Advanced Configuration

Set Memory/CPU Limits

docker run -d -p 8080:8080 \
  --memory="6g" --cpus="2.0" \
  dockermodelrunner/gemma:2b-q4

Mount Prompts from File

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d @prompt.json
prompt.json:
{
  "prompt": "Give me five use cases for local LLMs in education.",
  "max_tokens": 100
}

Using It in Node.js Apps

const axios = require('axios');

async function runPrompt() {
  const res = await axios.post('http://localhost:8080/v1/completions', {
    prompt: 'How does quantization reduce model size?',
    max_tokens: 80
  });

  console.log(res.data.choices[0].text);
}

runPrompt();

Benchmarking Locally

Test inference speed:

time curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Benchmark latency.", "max_tokens": 10}'
Track metrics with Docker:
docker stats local-gemma

Data Privacy & Air-Gapped Use Cases

Because models work locally:

No prompt data exists in your machine.
Ideal for confidential enterprise workflows
Safe for academic use where FERPA/GDPR apply

Real-World Use Cases

AI-Powered Command-Line Interface tools: Create command-line tools similar to GPT using Bash, cURL, and Docker.
Private chatbots: They act as internal assistants or help desks that exist behind your firewall.
Offline AI Notebooks: Run LLMs on your laptop during flights or in remote areas.
Student Coding Projects: Educate AI locally without worrying about cloud credit consumption.

Pro Tip: Mix with LangChain

The Docker Model Runner produces OpenAI-compatible completions and therefore its strength is in:

pip install langchain
from langchain.llms import OpenAI
llm = OpenAI(openai_api_base="http://localhost:8080/v1", openai_api_key="none")
llm("Write a haiku about containers.")

Troubleshooting

If Docker is not found the ensure Docker Desktop is running
If you get an API error then check model logs:
docker logs local-gemma
If you experience slowness then allocate more CPU/RAM in Docker Desktop settings

Wrapping Up

Docker Model Runner makes local LLM development accessible, fast, and fun. You can:

Run models like Gemma or Mistral using a single command.
Create low-latency, secure artificial intelligence applications.
Integrate with familiar tools (Node, Python, curl) And you get all done without ever touching the cloud.

So, go ahead and set up your own LLM workstation. If you got questions or ideas then drop them in the comments below. If you found this helpful, share it with Dev.to friends!

DEV Community