DEV Community

Cover image for Running Quantized LLMs Locally: Unlocking Docker Model Runner's Potential
Aditya Gupta
Aditya Gupta

Posted on

Running Quantized LLMs Locally: Unlocking Docker Model Runner's Potential

Have you ever dreamed of running powerful Large Language Models (LLMs) on your own machine, free from cloud service complexity and major infrastructure costs? Docker Model Runner is an innovative solution to running quantized LLMs locally. Get set to leverage artificial intelligence capabilities directly from your desktop without needing a GPU cluster and without worrying about billing notifications.

Why Local Quantized LLMs?

As generative AI revolutionizes all levels of technological infrastructure, developers, startups, researchers, and hobbyists are now integrating AI into their internal tools and workflow, on top of their agents and chatbots. However, most cloud-based offerings:

  • Billing by token or minute is likely to result in mounting costs.
  • Require API keys and rate limits
  • Share confidential data with outside service providers.

Image description

Quantized local LLMs help overcome such challenges. Quantization is the process that reduces memory consumption and speeds up inference by compressing the model, for example, using int4 or int8 weights. When running locally, they:

  • Act quickly with little delay.
  • Protect your data's privacy and security.
  • Working without internet connectivity and with low operational costs.

It is perfect for developers testing ideas, privacy-conscious startups, and educators training students in AI.

Meet Docker Model Runner

Image description
Docker Model Runner is Docker's effort to solve the issues with local GenAI. It enables the following:

  • Run LLMs with a single command
  • Expose OpenAI-compatible APIs locally
  • Avoid Python dependency hell
  • Switch seamlessly among models. See it as your own personal AI runtime environment.

Prerequisites

Before you begin:

Optional but helpful:

  • Familiarity with curl, Node.js, or Python

Quick Start: Running Gemma Locally

Image description

Google's Gemma 2B quantized model is a great starting point for local LLM experiments.

Step 1: Pull the Docker Image

docker pull dockermodelrunner/gemma:2b-q4

This downloads the quantized Gemma 2B model optimized for int4 precision.

Step 2: Run the Model

docker run -d \
  -p 8080:8080 \
  --name local-gemma \
  dockermodelrunner/gemma:2b-q4
Enter fullscreen mode Exit fullscreen mode

This command starts up the model and assigns local port 8080 to it. The container then exposes an OpenAI-compatible API.

Step 3: Send Your First Prompt

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "Explain Docker Model Runner in 2 sentences.",
        "max_tokens": 60
      }'
You'll receive a JSON response like:
{
  "id": "cmpl-1",
  "object": "text_completion",
  "choices": [
    {
      "text": "Docker Model Runner is a tool that allows developers to run large language models locally. It simplifies deployment, improves privacy, and reduces cost."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Switching Between Models

Try another model like Mistral:
docker run -d -p 8081:8080 dockermodelrunner/mistral:7b-q4

Use port 8081 to communicate with the Mistral container when running Gemma on port 8080 simultaneously.

Advanced Configuration

Set Memory/CPU Limits

docker run -d -p 8080:8080 \
  --memory="6g" --cpus="2.0" \
  dockermodelrunner/gemma:2b-q4
Enter fullscreen mode Exit fullscreen mode

Mount Prompts from File

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d @prompt.json
prompt.json:
{
  "prompt": "Give me five use cases for local LLMs in education.",
  "max_tokens": 100
}
Enter fullscreen mode Exit fullscreen mode

Using It in Node.js Apps

const axios = require('axios');

async function runPrompt() {
  const res = await axios.post('http://localhost:8080/v1/completions', {
    prompt: 'How does quantization reduce model size?',
    max_tokens: 80
  });

  console.log(res.data.choices[0].text);
}

runPrompt();
Enter fullscreen mode Exit fullscreen mode

Benchmarking Locally

Test inference speed:

time curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Benchmark latency.", "max_tokens": 10}'
Track metrics with Docker:
docker stats local-gemma
Enter fullscreen mode Exit fullscreen mode

Data Privacy & Air-Gapped Use Cases

Because models work locally:

  • No prompt data exists in your machine.
  • Ideal for confidential enterprise workflows
  • Safe for academic use where FERPA/GDPR apply

Real-World Use Cases

  • AI-Powered Command-Line Interface tools: Create command-line tools similar to GPT using Bash, cURL, and Docker.
  • Private chatbots: They act as internal assistants or help desks that exist behind your firewall.
  • Offline AI Notebooks: Run LLMs on your laptop during flights or in remote areas.
  • Student Coding Projects: Educate AI locally without worrying about cloud credit consumption.

Pro Tip: Mix with LangChain

The Docker Model Runner produces OpenAI-compatible completions and therefore its strength is in:

pip install langchain
from langchain.llms import OpenAI
llm = OpenAI(openai_api_base="http://localhost:8080/v1", openai_api_key="none")
llm("Write a haiku about containers.")
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

  • If Docker is not found the ensure Docker Desktop is running
  • If you get an API error then check model logs:
    docker logs local-gemma

  • If you experience slowness then allocate more CPU/RAM in Docker Desktop settings

Wrapping Up

Docker Model Runner makes local LLM development accessible, fast, and fun. You can:

  • Run models like Gemma or Mistral using a single command.
  • Create low-latency, secure artificial intelligence applications.
  • Integrate with familiar tools (Node, Python, curl) And you get all done without ever touching the cloud.

So, go ahead and set up your own LLM workstation. If you got questions or ideas then drop them in the comments below. If you found this helpful, share it with Dev.to friends!

Top comments (0)