Urvil Joshi

Posted on Oct 7

Docker Model Runner: Run AI Models Locally Within Your Docker Ecosystem

#docker #ai #devops #machinelearning

Docker Model Runner (DMR)

Docker Model Runner (DMR) officially reached General Availability on September 18th, transitioning from its beta phase that began in April. This powerful tool enables developers to pull, run, and manage AI models locally within the Docker ecosystem, bringing the convenience of containerization to machine learning workflows.

What is Docker Model Runner?

Docker Model Runner allows you to run Large Language Models (LLMs) directly on your local machine while leveraging Docker's robust ecosystem. It combines the best features of local AI inference with Docker's familiar tooling and workflows.

Key Features

1. Local LLM Execution

Running LLM models locally provides several critical advantages:

Enhanced Data Security: Your data remains entirely on your local machine, never leaving your control or being sent to external services. This is particularly important for sensitive or proprietary information.
Accelerated Development Workflows: Developers can iterate faster by running AI models alongside their applications without network latency or API rate limits.
Seamless Integration: If you're already using Docker Compose for your development environment, you can easily add AI models to your stack. When you spin up your containers, your LLM will launch simultaneously, creating a fully integrated local development environment.

2. OpenAI-Compatible APIs

Docker Model Runner provides OpenAI-compatible API endpoints, making integration straightforward. Many applications already use OpenAI's API format, which means:

No code changes required in existing applications
Client applications can switch seamlessly between cloud and local models
Response formats remain consistent with OpenAI standards
Your existing parsing logic continues to work without modification

3. Integrated Inference Engine

The architecture is designed for optimal performance:

Models run on your host machine rather than inside Docker containers, maximizing performance
Utilizes Llama.cpp inference server for efficient model execution
Automatic NVIDIA GPU support when available
Combines Docker's ecosystem management capabilities with Ollama-like performance
Provides Docker commands for pulling, caching, managing, and running models

4. OCI Artifact Distribution

Models are packaged and distributed as Open Container Initiative (OCI) artifacts, the same standardized format used for Docker images. This means:

Models can be pushed to any OCI-compatible registry
Standardized packaging ensures consistency and portability
Most models are distributed in GGUF (GPT-Generated Unified Format)

GGUF uses quantization to reduce model size, enabling AI models to run on standard hardware, including CPU-only systems. This format is ideal for local deployments where computational resources may be limited.

5. Multiple Interaction Methods

Docker Model Runner offers flexibility in how you interact with models:

Command-line interface for terminal-based interactions
Docker Desktop GUI for visual model management
OpenAI-compatible REST APIs for programmatic access

6. Parallel Multi-Model Support

Need to run multiple models simultaneously? Docker Model Runner handles this effortlessly. For example, if you're building an AI agent that performs text summarization and image generation, you can run both models in parallel without complex configuration. Models can be accessed through the GUI, CLI, and API endpoints concurrently.

Getting Started

Prerequisites

Docker Desktop version 4.41.0 or higher
(Optional) NVIDIA GPU for accelerated inference

Configuration

Open Docker Desktop settings and enable:

GPU-backed inference: Allows automatic NVIDIA GPU detection and utilization
Host TCP support: Enables OpenAI-compatible API access via HTTP
CORS settings: Set to "all" if you encounter API access issues

Finding and Pulling Models

From Docker Hub:

Navigate to the AI section in Docker Hub and search for models using the ai/ prefix. Popular models like Llama 3.2, Mistral, and Phi-3 are readily available. Each model listing shows different quantization versions, allowing you to balance performance and resource requirements.

To pull a model:

docker model pull ai/llama3.2:1b-instruct-q4_K_M

From Hugging Face:

Browse to your desired model on Hugging Face, select "Use this model," and choose "Docker Model Runner" as the deployment method. The interface will display the appropriate pull command with your selected quantization level.

Basic Commands

Check Docker Model Runner status:

docker model status

List downloaded models:

docker model list

This displays model metadata including name, parameters, quantization level, and architecture.

Interacting with Models

Command-Line Interface

Single query:

docker model run ai/llama3.2:1b-instruct-q4_K_M "What is Docker?"

Interactive session:

docker model run ai/llama3.2:1b-instruct-q4_K_M

This opens a chat interface where you can have multi-turn conversations. Docker Model Runner maintains context across multiple exchanges. Exit by typing /bye.

Docker Desktop GUI

In the Models tab, navigate to the Local section and click "Run" next to your desired model. This launches an interactive interface where you can:

Chat with the model through a text input field
View the Inspect tab for model metadata and architecture details
Check the Requests tab to see your conversation history

The GUI maintains multi-turn conversation context, allowing natural, contextual interactions.

OpenAI-Compatible API

With host TCP support enabled in Docker Desktop settings, you can access models via REST API on the configured port (default varies based on your settings):

curl -X POST http://localhost:PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

The response format matches OpenAI's API specification, ensuring compatibility with existing tooling and parsers.

Docker Model Runner vs. Ollama

Both tools enable local AI model execution, but they have distinct characteristics:

Performance: Docker Model Runner runs models on the host machine rather than in containers, typically achieving approximately 12% better performance than containerized approaches. Ollama also runs on the host, either as a standalone binary or managed service, providing similar performance benefits.
Integration: Docker Model Runner provides seamless integration with Docker Desktop and Docker Compose, making it ideal if you're already using Docker for development. Models can be defined in your compose files and started automatically with your application stack. Ollama operates as a standalone application with its own CLI and basic API.
API Endpoints: Both offer OpenAI-compatible endpoints, but they use different default ports. You can configure these as needed for your environment.

Tips and Resources

The official Docker Model Runner documentation provides comprehensive guidance for various platforms including WSL 2, Linux, and macOS. The "Known Issues" section addresses common problems and their solutions.

For those interested in the technical details, the Docker team has published an in-depth blog post covering the design philosophy, goals, GPU acceleration strategies, and high-level architecture. This resource is invaluable for understanding the engineering decisions behind Docker Model Runner.

Conclusion

Docker Model Runner represents a natural evolution for developers already invested in the Docker ecosystem. By bringing local AI model execution to Docker Desktop, it eliminates the need for separate tools while providing familiar commands and workflows.

The combination of data privacy, development speed, and seamless integration makes Docker Model Runner particularly attractive for:

Development teams building AI-powered applications
Organizations with data sensitivity requirements
Developers seeking faster iteration cycles
Teams already standardized on Docker tooling

If you're currently using Docker for development but haven't explored local AI model execution, Docker Model Runner offers a compelling entry point. Its integration with existing Docker workflows means minimal learning curve while unlocking powerful AI capabilities directly in your development environment.

Whether you're building chatbots, implementing RAG systems, or experimenting with AI agents, Docker Model Runner provides the infrastructure to do so efficiently and securely on your local machine.

Resources

Docker Model Runner Tutorial 2025: Run AI Models Locally in Minutes | Complete Guide

DEV Community