DEV Community: Oleg Šelajev

Running LLMs with Docker on Linux: from local to CI

Oleg Šelajev — Fri, 20 Jun 2025 09:38:43 +0000

Earlier this year, Docker released Docker Model Runner, a component integrated into Docker Desktop that allows you to run Large Language Models (LLMs) locally on your machine. Unlike typical container-based execution, Docker Model Runner can leverage the full capabilities of your GPU hardware directly, offering optimal performance. Initially available on macOS and Windows through Docker Desktop, Docker Model Runner is now also available as part of Docker Community Edition (CE). This expansion means you can integrate it seamlessly into your Linux-based continuous integration (CI) pipelines or even use it directly in production.

In this article, we’ll explore how you can install Docker Model Runner on a Linux VM with Docker already available. We'll go through pulling some LLMs, running them, and clarify which URLs you'll use to connect to your models from applications.

Getting a Linux VM

First, we need a Linux VM. To keep things simple, we’ll use Google Cloud Platform’s Shell console, which provides a convenient Linux VM environment right in your browser without needing to provision custom resources.

The VM provided through Cloud Shell isn’t particularly powerful, but it has Docker pre-installed, making it ideal for our demonstration.

To launch it:

Go to your Google Cloud Platform Console.
Click on the Shell icon at the top-right corner.
Authorize the browser if prompted.

Verify Docker installation by running:

$ docker --version
Docker version 28.2.2, build e6534b4

Installing Docker Model Runner

Docker Model Runner on Linux uses standard Docker primitives like containers and volumes to manage GPU passthrough and LLM lifecycle efficiently.

First, install the required plugin package:

sudo apt install docker-model-plugin

After installation, you can confirm everything is set up correctly by running:

docker models list

This command initially pulls necessary infrastructure components. Once complete, it will display any models available locally. Since we haven't downloaded any yet, it'll show an empty list.

Pulling and Running an LLM

Next, let's install a small, resource-efficient model suitable for the Cloud Shell VM. You can choose a model from the Docker AI Hub at hub.docker.com/u/ai. For this demonstration, we'll use a small Qwen model:

docker model pull ai/qwen3:0.6B-Q4_K_M

Once pulled, verify it by running the model interactively:

docker model run ai/qwen3:0.6B-Q4_K_M

You can ask questions and get the typical LLM answers:

Connecting to the Model

Docker Model Runner hosts an inference server that you can connect to using a standard HTTP endpoint. Internally, from Docker containers, the server is accessible via:

http://172.17.0.1:12434/engines/v1

Externally, from your local machine or other environments, use:

http://localhost:12434

For example, to query your model via an OpenAI-compatible API:

curl http://localhost:12434/engines/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ai/qwen3:0.6B-Q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please write 100 words about the fall of Rome."}
  ]
}'

The response will be JSON-formatted and include the completion text provided by the model, something like:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The fall of Rome marked the end of the Roman Empire, which had long been a dominant power in the Mediterranean. The decline was driven by internal struggles, including political instability and weakened central authority, alongside external pressures and shifting alliances. The collapse of the Empire had profound effects on Europe, shaping the course of medieval civilization. As Rome faded, its legacy endured through art, religion, and the enduring influence of its legacy on Western culture."
      }
    }
  ],
  "created": 1750411314,
  "model": "ai/qwen3:0.6B-Q4_K_M",
  "usage": {"completion_tokens": 252, "prompt_tokens": 32, "total_tokens": 284}
}

Conclusion

With Docker Model Runner, you can now easily run powerful LLMs locally on Windows, macOS, and Linux, significantly simplifying integration into your development workflows. Whether you're using it for local experimentation or in CI environments Docker Model Runner provides a straightforward solution to add AI to your applications without breaking a sweat.

Implementing MCP Servers in Java: Stockfish example

Oleg Šelajev — Thu, 29 May 2025 10:24:54 +0000

Among other things, Model Context Protocol (MCP) enables AI models to interact with external tools and services through a structured interface. Which allows models to defer control to actual software libraries and execute tasks with reproducibility, predictable performance, and security guarantees.

This blog post demonstrates creating an MCP server in Java that integrates the open-source chess engine Stockfish. We will use this MCP server to equip AI models with the state of the art ability to analyze chess positions and moves.

We chose Java because it's an enterprise-standard language widely adopted for large-scale applications. Its ecosystem is robust, mature, and continues to power thousands of enterprise solutions.

Stockfish Setup

Stockfish is a highly popular open source chess engine.

The MCP server implementation will consist of a Docker image with the Stockfish binary for chess analysis and a Quarkus application implementing the MCP protocol around the Stockfish binary.

You can check out the complete project on GitHub: https://github.com/shelajev/mcp-stockfish/tree/main

To implement the MCP server functionality with Quarkus, we need the quarkus-mcp-server-sse dependency, which you can install just like any other Quarkus extension using a Maven command:

./mvnw quarkus:add-extension -Dextensions="io.quarkiverse.mcp:quarkus-mcp-server-sse"

It will essentially add the necessary dependency section to the pom.xml and in case of more complex extenstions can also add the build plugins, and change the configuration for everything to work out of the box.

<dependency>
  <groupId>io.quarkiverse.mcp</groupId>
  <artifactId>quarkus-mcp-server-sse</artifactId>
  <version>1.2.0</version>
</dependency>

The tool implementation, marked with a @Tool annotation will be automatically picked up by Quarkus and registered to be announced when the MCP requests start coming in.

@Singleton
public class MyTools {
  @Tool(description = "Analyze a chess position using Stockfish.")
  ToolResponse stockfish(@ToolArg(description = "FEN of the chess position") String fen) {
    int depth = 15;
    int timeoutSeconds = 3;
    String command = """
      expect -c "spawn stockfish; \
      send \"uci\r\"; \
      send \"setoption name MultiPV value 2\r\"; \
      send \"position fen %s\r\"; \
      send \"go depth %d\r\"; \
      sleep %d; \
      send \"quit\r\"; interact"
      """.formatted(fen, depth, timeoutSeconds);

    return ToolResponse.success(new TextContent($(command).get()));
  }
}

Note the very convenient $(command).get(), which uses Jash - a Java library to provide a fluent interface to Process.
It really is a nice API to run shell commands from Java.

The expect utility handles interaction with the Stockfish binary because unlike typical CLI applications where an invocation gets all the parameters and represents a complete unit of computation, Stockfish CLI starts and then you send commands to it both for configuration and chess analysis. It's more convenient when you need to analyze sequences of moves at once, but a bit awkward as a CLI to integrate against.

Anyway, we package our MCP server as a Docker container using this Dockerfile: https://github.com/shelajev/mcp-stockfish/blob/main/src/main/docker/Dockerfile.jvm

This Dockerfile uses multi-stage build to compile the Stockfish binary and then the standard Quarkus boilerplate instructions the app.

And of course we copy the Stockfish binary to the final image too:

COPY --from=stockfish_builder /opt/Stockfish/src/stockfish /usr/local/bin/stockfish

Building and Running the Server

Build the Quarkus application using:

./mvnw verify -Dquarkus.container-image.build=true

The Docker image will be tagged as shelajev/mcp-stockfish:0.0.1. Run the server using:

docker run -p 8080:8080 shelajev/mcp-stockfish:0.0.1

Testing our glorious MCP

Verify the server works using the MCP inspector by running:

npx @modelcontextprotocol/inspector

Then connect it to http://localhost:8080/mcp where Quarkus MCP Server extension hosts the endpoint (you can override the config, but for use the defaults suffice):

Click on the tools, and the list tools button. Then manually call the MCP server to verify its functionality.

Consider this famous endgame study by Richard Réti. It's FEN, the notation describing a chess postion, which we'll be passing to the MCP server and Stockfish is 7K/8/k1P5/7p/8/8/8/8 w - - 0 1.

You can see in the output that Stockfish analyzes the position and correctly suggests Kg7 as the optimal move, leading to a surprising draw.

Integration with AI assitants

You can of course integrate this MCP server into any AI assistant that supports the protocol. As an exercise for the reader, try to configure Goose, which we set up in a previous article to use our Stockfish MCP.

We'll integrate it into VS Code:

In the workspace create the mcp.json file and configure the server like this:

{
  "servers": {
    "stockfish": {
        "name": "Stockfish",
        "url": "http://localhost:8080/mcp",
        "type": "http"
    }
  }
}

And click the start button on the server definition so VS Code will connect to it:

Now all that is left is to use the chat in the Agent mode, so it has access to tools, and request to analyze chess positions:

Conclusion

Creating an MCP server in Java is super straightforward,
and is a great way to enhance AI assistants with predictable, repeatable functionality.

Running MCP servers in Docker adds isolation, reproducibility, and security benefits, so in general one should probably prefer that over running naked npx servers and other similar approaches.

But also you don't have to roll MCP servers yourself for most standard APIs you want to integrate with.

For typical integrations (Slack, Notion, GitHub), there's Docker’s MCP toolkit which makes working with MCP even better providing discovery, simplified installation, secrets management, access control, etc -- all the good things you want if you're building production grade systems.

AI-Enhanced Mock APIs with Docker Model Runner and Microcks

Oleg Šelajev — Mon, 26 May 2025 14:01:06 +0000

Microcks is a powerful CNCF tool that allows developers to quickly spin up mock services for development and testing. By providing predefined mock responses or generating them directly from an OpenAPI schema, you can point your applications to consume these mocks instead of hitting real APIs, enabling efficient and safe testing environments.

Docker Model Runner is a convenient way to run LLMs locally within your Docker Desktop. It provides an OpenAI-compatible API, allowing you to integrate sophisticated AI capabilities into your projects seamlessly, using local hardware resources.

By integrating Microcks with Docker Model Runner, you can enrich your mock APIs with AI-generated responses, creating realistic and varied data that is less rigid than static examples.

In this guide, we'll explore how to set up these two tools together, giving you the benefits of dynamic mock generation powered by local AI.

Setting Up Docker Model Runner

To start, ensure you've enabled Docker Model Runner as described in our previous guide on configuring Goose for a local AI assistant setup: Easy Private AI Assistant with Goose and Docker Model Runner.

Next, select and pull your desired LLM model from Docker Hub. For example:

docker model pull ai/qwen3:8B-F16

Configuring Microcks with Docker Model Runner

First, clone the Microcks repository:

git clone https://github.com/microcks/microcks --depth 1

Navigate to the Docker Compose setup directory:

cd microcks/install/docker-compose

You'll need to adjust some configurations to enable the AI Copilot feature within Microcks.
In the /config/application.properties file, configure the AI Copilot to use Docker Model Runner:

ai-copilot.enabled=true
ai-copilot.implementation=openai
ai-copilot.openai.api-key=irrelevant
ai-copilot.openai.api-url=http://model-runner.docker.internal:80/engines/llama.cpp/
ai-copilot.openai.timeout=600
ai-copilot.openai.maxTokens=10000
ai-copilot.openai.model=ai/qwen3:8B-F16

We're using the model-runner.docker.internal:80 as the base URL for the OpenAI compatible API. Docker Model Runner is available there from the containers running in Docker Desktop and using it ensures direct communication between the containers and the model runner avoiding unnecessary networking using the host machine ports.

Next, enable the copilot feature itself by adding this line to the Microcks config/features.properties file:

features.feature.ai-copilot.enabled=true

Running Microcks

Start Microcks with Docker Compose in development mode:

docker-compose -f docker-compose-devmode.yml up

Once up, access the Microcks UI at http://localhost:8080.

Install the example API for testing. Click through these buttons on the Microcks page:
Microcks Hub → MicrocksIO Samples APIs → pastry-api-openapi v.2.0.0 → Install → Direct import → Go.

Using AI Copilot Samples

Within the Microcks UI, navigate to the service page of the imported API and select an operation you'd like to enhance. Open the "AI Copilot Samples" dialog, prompting Microcks to query the configured LLM via Docker Model Runner.

You may notice increased GPU activity as the model processes your request.

After processing, the AI-generated mock responses are displayed, ready to be reviewed or added directly to your mocked operations.

You can easily test the generated mocks with a simple curl command. For example:

curl -X PATCH 'http://localhost:8080/rest/API+Pastry+-+2.0/2.0.0/pastry/Chocolate+Cake' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{"status":"out_of_stock"}'

{
  "name" : "Chocolate Cake",
  "description" : "Rich chocolate cake with vanilla frosting",
  "size" : "L",
  "price" : 12.99,
  "status" : "out_of_stock"
}

This returns a realistic, AI-generated response that enhances the quality and reliability of your test data.

For better reproducibility, you can specify the Docker Model Runner dependency and the chosen model explicitly in your compose.yml:

ai_runner:
  provider:
    type: model
    options:
      model: ai/qwen3:8B-F16

Then just starting the compose setup will pull the model too and wait for it to be available the same way it does for containers.

Conclusion

Docker Model Runner is an excellent local resource for running LLMs and provides compatibility with OpenAI APIs, allowing for seamless integration into existing workflows.
Microcks, for example, can use Docker Model Runner to generate sample responses for the API it mocks, so you have a richer synthetic data for your integration testing purposes.

In this article we looked at what it takes to configure these two tools work together. If you have local AI workflows or just run LLMs locally, please let me know, I'd love to explore more local AI integrations with Docker.

Easy private AI assistant with Goose and Docker Model Runner

Oleg Šelajev — Wed, 21 May 2025 11:27:50 +0000

Using Goose and Docker Model Runner

Goose is an innovative CLI assistant designed to automate development tasks using AI models. Docker Model Runner simplifies deploying AI models locally with Docker. Combining these technologies, you get a powerful local environment with advanced AI assistance, ideal for coding and automation.

Install Goose CLI on macOS

Install Goose via the curl2sudo oneliner technique:

curl -fsSL https://github.com/block/goose/releases/download/stable/download_cli.sh | bash

Enable Docker Model Runner

First, ensure you have Docker Desktop installed, then configure Docker Model Runner with your model of choice. Go to Settings -> Beta features and check the checkboxes for Docker Model Runner.

By default it’s not wired to be available from your host machine, as a security precaution, but we want to simplify the setup, and enable the TCP support as well. The default port for that would be 12434, so the base URL for the connection would be: http://localhost:12434

Now we can pull the models from Docker Hub: hub.docker.com/u/ai and run the models

docker model pull ai/qwen3:30B-A3B-Q4_K_M
docker model run ai/qwen3:30B-A3B-Q4_K_M

This command starts the interactive chat with the model.

Configure Goose for Docker Model Runner

Edit your Goose config at ~/.config/goose/config.yaml:

GOOSE_MODEL: ai/qwen3:30B-A3B-Q4_K_M
GOOSE_PROVIDER: openai
extensions:
  developer:
    display_name: null
    enabled: true
    name: developer
    timeout: null
    type: builtin
GOOSE_MODE: auto
GOOSE_CLI_MIN_PRIORITY: 0.8
OPENAI_API_KEY: irrelevant
OPENAI_BASE_PATH: /engines/llama.cpp/v1/chat/completions
OPENAI_HOST: http://localhost:12434

The OPENAI_API_KEY is irrelevant as Docker Model Runner does not require authentication because the model is run locally and privately on your machine.

We provide the base path for the OpenAI compatible API, and choose the model GOOSE_MODEL: ai/qwen3:30B-A3B-Q4_K_M that we have pulled before.

Testing It Out

Try Goose CLI by running goose in the terminal. You can see that is automatically connects to the correct model, and when you ask for something, you’ll see the GPU spike as well.

Now, we also configure Goose to have the Developer extension enabled. It allows it to run various commands on your behalf, and makes it a much more powerful assistant with access to your machine than just a chat application.

You can additionally configure the custom hints to goose to tweak its behaviour using the .goosehints file.

And what’s even better, you can script Goose to run tasks on your behalf a simple one-liner:

goose run -t "your instructions here" or goose run -i instructions.md

where instructions.md is the file with what to do.

On macos you have access to crontab for scheduling recurrent scripts, so you can automate Goose with Docker Model Runner to activate repeatedly and act on your behalf. For example,
crontab -e , will open the editor for the commands you want to run, and a like like:

5 8 * * 1-5 goose run -i fetch_and_summarize_news.md

Will make Goose run at 8:05 am every workday and follow the instructions in the fetch_and_summarize_news.md file. For example to skim the internet and prioritize news based on what you like.

Conclusion

All in all integrating Goose with Docker Model Runner creates a simple but powerful setup for using local AI for your workflows.
You can make it run custom instructions for you or easily script it to perform repetitive actions intelligently.
It is all powered by a local model running in Docker Model Runner, so you don't compromise on privacy either.