DEV Community

Cover image for Stop Burning API Credits While Building AI Apps: Run Local LLMs with Docker Model Runner
Raju Dandigam
Raju Dandigam

Posted on

Stop Burning API Credits While Building AI Apps: Run Local LLMs with Docker Model Runner

Building AI features usually starts with a cloud API. That is the fastest path when you are experimenting with chat interfaces, summarization, classification, content generation, or agent workflows. You add an SDK, pass an API key, send a prompt, and get a response back.

That simplicity is great, but during active development it can also become noisy. Every prompt experiment, failed test, retry, debugging session, and local demo sends another request to a paid service. For one developer, the cost may be small. For a team building AI features every day, those calls can add up quickly. There is also another concern: not every development prompt should leave your machine, especially when you are testing with internal documents, customer-like data, logs, or proprietary examples.

Docker Model Runner gives JavaScript developers another option. It lets you run AI models locally using Docker’s workflow and expose them through APIs that feel familiar to developers already using OpenAI-style clients. Docker describes Model Runner as a way to run and manage AI models locally, serve models through OpenAI and Ollama-compatible APIs, and package model files as OCI artifacts. That means AI models can start behaving more like other Docker-managed development dependencies.

This does not mean local models replace cloud models for every use case. They usually do not. Cloud models are still better for production workloads that need high-quality reasoning, scale, reliability, and the latest model capabilities. The more useful point is simpler: local models are very useful during development, especially when you want fast iteration, predictable cost, and better control over data.

Here is the workflow in one view.

The application code can stay almost the same. The main difference is configuration. In development, your OpenAI-compatible client points to Docker Model Runner. In production, it points to your cloud provider.

Docker Model Runner is integrated with Docker Desktop and Docker Engine. Docker’s API reference shows that host processes can access the Model Runner API at http://localhost:12434, while containers can access it through Docker networking patterns such as model-runner.docker.internal:12434 when configured through Compose.

Before writing code, enable Docker Model Runner in Docker Desktop if it is not already enabled. Then confirm the CLI is available.

docker model --help
Enter fullscreen mode Exit fullscreen mode

You can pull a model using the Docker model command. The exact model you choose depends on what is available in your Docker environment and what your machine can run comfortably.

docker model pull ai/llama3.2:3B-Q4_K_M
Enter fullscreen mode Exit fullscreen mode

After pulling a model, you can run a quick prompt from the command line.

docker model run ai/llama3.2:3B-Q4_K_M "Explain Docker containers in one sentence."
Enter fullscreen mode Exit fullscreen mode

This is already useful for quick experiments, but the real value for JavaScript developers comes from calling the local model from a Node.js app.

Install the OpenAI SDK.

npm install openai
Enter fullscreen mode Exit fullscreen mode

Now create a small TypeScript helper that talks to the local Docker Model Runner endpoint.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY || "local-development-key",
  baseURL: process.env.OPENAI_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1"
});

export async function generateSummary(text: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "ai/llama3.2:3B-Q4_K_M",
    messages: [
      {
        role: "system",
        content: "You summarize technical text clearly and briefly."
      },
      {
        role: "user",
        content: `Summarize this text in three sentences:\n\n${text}`
      }
    ],
    temperature: 0.3
  });

  return response.choices[0]?.message?.content ?? "";
}
Enter fullscreen mode Exit fullscreen mode

Then call it from a simple script.

import { generateSummary } from "./generate-summary";

async function main() {
  const summary = await generateSummary(`
    Docker Model Runner lets developers run AI models locally and call them
    through familiar API formats. This can reduce development cost and keep
    sensitive experimentation data on the developer machine.
  `);

  console.log(summary);
}

main().catch((error) => {
  console.error("Failed to generate summary:", error);
  process.exit(1);
});
Enter fullscreen mode Exit fullscreen mode

The most important part is not the example itself. The important part is the boundary. Your application is not tightly coupled to one provider. It is coupled to an OpenAI-compatible interface. That gives you flexibility.

In local development, you can use this environment configuration.

OPENAI_BASE_URL=http://localhost:12434/engines/llama.cpp/v1
OPENAI_API_KEY=local-development-key
Enter fullscreen mode Exit fullscreen mode

The rest of your application does not need to change. This pattern is valuable because most AI application code should not care whether the model is running locally or remotely. It should care about the contract: send messages, receive a response, handle errors, and validate the output.

A practical use case for local models is development-time text processing. For example, imagine you are building an internal support tool that summarizes customer tickets before a human reads them. During development, you may run the same prompt hundreds of times while tuning the wording, testing edge cases, and adjusting the UI. A local model is a good fit for that stage because you are optimizing the workflow, not making final production-quality decisions.

Here is a slightly more realistic example.

type TicketSummary = {
  category: "billing" | "bug" | "account" | "other";
  summary: string;
};

export async function summarizeTicket(ticketText: string): Promise<TicketSummary> {
  const response = await client.chat.completions.create({
    model: "ai/llama3.2:3B-Q4_K_M",
    messages: [
      {
        role: "system",
        content:
          "Classify the support ticket and summarize it. Return only valid JSON."
      },
      {
        role: "user",
        content: ticketText
      }
    ],
    temperature: 0.2
  });

  const content = response.choices[0]?.message?.content ?? "{}";

  try {
    return JSON.parse(content) as TicketSummary;
  } catch {
    return {
      category: "other",
      summary: "The model returned an invalid response."
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

This example is intentionally simple. In a real application, you would validate the response with a schema library such as Zod, add retries for invalid JSON, and log model behavior for debugging. The point is that Docker Model Runner lets you build and test this workflow locally without sending every prompt to a cloud API.

Docker is also moving toward making models fit naturally into Compose-based development. The Docker Compose model reference describes a models section where an AI model can be defined as an OCI artifact, pulled and served by Model Runner, and then exposed to an application through injected connection information.

Conceptually, that means a future local AI development stack can look like this.

services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      OPENAI_BASE_URL: http://model-runner.docker.internal:12434/engines/llama.cpp/v1
      OPENAI_API_KEY: local-development-key
    extra_hosts:
      - "model-runner.docker.internal:host-gateway"
Enter fullscreen mode Exit fullscreen mode

This keeps the Node.js application containerized while still allowing it to reach the local Model Runner endpoint. Docker’s API docs specifically note that containers may need an extra_hosts entry to access model-runner.docker.internal through the host gateway.

There are several places where this local-first setup is useful.

It is useful for prompt iteration because you can test many versions without worrying about API usage. It is useful for privacy-sensitive development because test data can stay on your machine. It is useful for offline work after the model is already pulled. It is also useful for CI experiments where you want to run basic LLM-dependent tests without calling a cloud provider, although you should keep those tests small because local inference can be slower and hardware-dependent.

There are also clear limits.

Local models usually do not match the quality of the strongest hosted models. Smaller models can summarize, classify, rewrite, and answer simple questions reasonably well, but they may struggle with complex reasoning or long context tasks. Performance depends heavily on your hardware, especially RAM and GPU availability. A small model may run comfortably on a developer laptop, while a larger model may feel too slow for daily use.

Docker Model Runner is also best understood as a development tool first. Docker’s product page emphasizes local-first inference, no recurring API costs for local usage, privacy, and control. Those are development strengths. They do not automatically make it the right choice for high-scale production serving.

A healthy architecture is to keep both paths available.

Use local inference when you are designing prompts, building UI flows, testing basic behavior, working with sensitive examples, or experimenting with agent workflows. Use cloud inference when you need production reliability, stronger model quality, scale, monitoring, and service-level guarantees.

The bigger lesson is that AI development is starting to look more like normal software development. We want local dependencies. We want repeatable environments. We want clear configuration. We want the ability to run important parts of the system without depending on external services for every test.

Docker Model Runner fits into that shift. It brings AI models closer to the Docker workflow many developers already understand. You pull a model, run it locally, expose an API, and connect your application to it. For JavaScript and TypeScript developers, the OpenAI-compatible API makes the adoption path even easier because the application code can remain familiar.

This is not a replacement for cloud AI platforms. It is a practical addition to the developer toolbox. If you are building AI features in Node.js and you want cheaper prompt iteration, better local privacy, and a Docker-native workflow, Docker Model Runner is worth exploring.

Top comments (0)