DEV Community

Cover image for How run LLM in local using Docker.
Shivam
Shivam

Posted on

How run LLM in local using Docker.

Self-hosted LLMs are gaining a lot of momentum. They offer advantages such as improved performance, lower costs, and better data privacy. You don't need to rely on third-party APIs, which means no unexpected increases in latency, no sudden changes in model behavior, and more control over your LLM.

However, running a model locally is a task in itself. There is currently no standard way or tool available to run models on local machines.

Docker model runner:

Docker Model Runner makes running AI models as simple as running a container locally — just a single command, with no need for additional configuration or hassle.

If you are using Apple Silicon, you can take advantage of GPU acceleration for faster inference.

The Docker local LLM inference engine is built on top of llama.cpp. This engine is exposed through an OpenAI-compatible API. Before running an LLM model using Docker, make sure you are using Docker Desktop version 4.40 or later.

Running a model is similar to running a container. First, start by pulling a model.

docker model pull ai/llama3.1
Enter fullscreen mode Exit fullscreen mode

Full list of available models are here.

Once the pull is complete, your model is ready to use. You don't need to manually run any containers — Docker will automatically use its inference API server endpoint to handle your requests.

You can access your model from other containers using the http://model-runner.docker.internal/engines/v1 endpoint:

curl http://model-runner.docker.internal/engines/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ai/llama3.1",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Please write 500 words about the fall of Rome."
            }
        ]
    }'
Enter fullscreen mode Exit fullscreen mode

If you want to access the model from host processes (i.e., the machine where Docker is running), you need to enable TCP host access:

docker desktop enable model-runner --tcp 12434
Enter fullscreen mode Exit fullscreen mode

Here, 12434 is the TCP port where your model will be accessible.
You can then make requests from the host just like this:

curl http://localhost:12434/engines/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ai/llama3.1",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Please write 500 words about the fall of Rome."
            }
        ]
    }'
Enter fullscreen mode Exit fullscreen mode

With everything up and running, you're ready to make local LLM calls!
You can use this endpoint with any OpenAI-compatible clients or frameworks.

Quadratic AI

Quadratic AI – The Spreadsheet with AI, Code, and Connections

  • AI-Powered Insights: Ask questions in plain English and get instant visualizations
  • Multi-Language Support: Seamlessly switch between Python, SQL, and JavaScript in one workspace
  • Zero Setup Required: Connect to databases or drag-and-drop files straight from your browser
  • Live Collaboration: Work together in real-time, no matter where your team is located
  • Beyond Formulas: Tackle complex analysis that traditional spreadsheets can't handle

Get started for free.

Watch The Demo 📊✨

Top comments (0)

Jetbrains Survey

Calling all developers!

Participate in the Developer Ecosystem Survey 2025 and get the chance to win a MacBook Pro, an iPhone 16, or other exciting prizes. Contribute to our research on the development landscape.

Take the survey

AWS Security LIVE!

Hosted by security experts, AWS Security LIVE! showcases AWS Partners tackling real-world security challenges. Join live and get your security questions answered.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️