DEV Community

Cover image for Running AI Models in Docker
Flavius Dinu
Flavius Dinu

Posted on • Originally published at techblog.flaviusdinu.com

Running AI Models in Docker

Do you know what is the easiest way to run AI models? It’s of course, using Docker.

Docker has added support for running AI models in Docker Desktop 4.40.0, and it’s sooo easy to get started — right now, this is only available on Mac with Apple Silicon. Soon, this will also be available on Windows.

Note: I got early access to the feature, and this will be soon GA.

How does the Model Runner work?

The Docker Model Runner doesn’t run in a container, and it uses a host-installed inference server (llama.cpp), that runs locally on your computer. In the future, Docker will support additional inference servers.

To break down how this works:

  1. You have a host-level process that allows direct access to the hardware GPU acceleration
  2. GPU acceleration is enabled via Apple’s Metal API during query processing.
  3. Models are cached locally in your host machine’s storage and are dynamically loaded into memory by llama.cpp when needed This means that your data never leaves your infrastructure.

Models are stored as OCI artifacts in Docker Hub, ensuring compatibility with other registries, including internal ones. This approach enables faster deployments, reduces disk usage, and improves UX by avoiding unnecessary compression.

How can you use the Model Runner?

To get started, first go to Docker Desktop, and under Settings select Features in Development, and ensure the Enable Docker Model Runner is toggled on.

Image description

After you enable the feature, make sure you restart Docker Desktop.

To see if the model runner is up, you can simply run:

docker model status
Docker Model Runner is running
Enter fullscreen mode Exit fullscreen mode

If you want to see what are the available models in Docker:

docker model list
{"object":"list","data":[]}
Enter fullscreen mode Exit fullscreen mode

Of course, you haven’t pulled any model yet, the list is going to be empty, as it is in my case.

Right now, you can find a list of models here:

Image description

To pull a model, you can simply run:

docker model pull model_name
Enter fullscreen mode Exit fullscreen mode

For this example, I will use ai/llama3.2.

docker model pull ai/llama3.2
Model ai/llama3.2 pulled successfully
We are now ready to run the model. You can do it either interactively, or just do a one-off command.
Enter fullscreen mode Exit fullscreen mode
docker model run ai/llama3.2                                
Interactive chat mode started. Type '/bye' to exit.
> Hey how are you?
I'm just a language model, so I don't have feelings or emotions like 
humans do, but I'm functioning properly and ready to help you 
with any questions or tasks you may have! 
How about you? How's your day going?
docker model run ai/llama3.2 "Hello"
Hello! How are you today? Is there something I can help you with, or would you like to chat?
Enter fullscreen mode Exit fullscreen mode

Well, this is cool, but you want to take it to the next level and use it in your apps, right?

You can connect to the model in three ways:

  1. From the container by using the internal DNS name: http://model-runner.docker.internal/
  2. From the host using the Docker Socket
  3. From the host using TCP

Another great thing is that the API offers Open-AI compatible endpoints:

GET /engines/{backend}/v1/models
GET /engines/{backend}/v1/models/{namespace}/{name}
POST /engines/{backend}/v1/chat/completions
POST /engines/{backend}/v1/completions
POST /engines/{backend}/v1/embeddings
Enter fullscreen mode Exit fullscreen mode

Wrap-up

Docker Model Runner makes running AI models ridiculously simple. With just a few commands, you can pull, run, and integrate models into your applications — all while keeping everything local and secure.

This is just the beginning — as Docker expands support to Windows and additional inference servers, the experience will only get better.

Stay tuned, and keep building 🐳

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 🕒

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Explore a trove of insights in this engaging article, celebrated within our welcoming DEV Community. Developers from every background are invited to join and enhance our shared wisdom.

A genuine "thank you" can truly uplift someone’s day. Feel free to express your gratitude in the comments below!

On DEV, our collective exchange of knowledge lightens the road ahead and strengthens our community bonds. Found something valuable here? A small thank you to the author can make a big difference.

Okay