Joshua O.

Posted on Dec 9, 2025

Zero-Friction AI Dev: Build and Deploy Chatbots with a Single Docker Compose File

#ai #python #docker #rag

1. The Unspoken Challenge of the Agentic Era

2. The Model Problem Solved: Docker Model Runner and GPU Acceleration

3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)

3.1. What We’re Building: The Private Q&A Bot

3.2. Project Structure

3.3. Step-by-Step Code

3.4. Ingestion

3.5. The Final Proof: One Command

3.6. Testing

4. The Security Imperative: Sandboxes and Isolation

5. Conclusion: The New Standard for AI Stack Orchestration

Project repo - RAG-Docker-POC

1. The Unspoken Challenge of the Agentic Era

Two months ago, I wanted to build a code swarm - a multi-agent code reviewer. The first struggle was memory. If you’re a developer, you know the dream of building the next great autonomous AI agent is often interrupted by the nightmare of the toolchain. You don’t just need Python; you need a specific version of CUDA, a compatible PyTorch install, a dependency manager that behaves, and a local machine strong enough to even host a small Large Language Model (LLM).

This initial friction is where most AI projects stall. It’s what leads to the feeling that AI development is fundamentally different and far more complicated than traditional software engineering.

But what if you could eliminate that friction? What if getting a complex, multi-service AI application running on your laptop took less time than installing a single Python library?

Docker’s Mandate: Simplicity, Security, and Speed

The vision, as laid out by Docker, is simple: to make building and running AI applications as easy, secure, and shareable as any other kind of software.

It’s a mandate that is already paying off:

TheCUBE Research confirms that 52% of users cut their AI project setup time by more than half, while 87% accelerated their time-to-market by at least 26%.

Docker's impact by TheCUBE Research

TheCUBE Research Report on Docker’s impact: Report
Docker achieves this transformation by bringing standardization to the two most chaotic parts of any AI project: the model itself and the complex infrastructure required to support it.

The Proof: A Real-World RAG Pipeline

To prove this dramatic reduction in friction and time-to-market, we are going to build a Full Retrieval-Augmented Generation (RAG) Pipeline - the foundation of modern and private AI applications.

Manually setting up the three core components - the LLM, the Vector Database, and the Orchestration Service - can take hours or even days. With Docker’s new tooling, it takes a single command: bash docker compose up

2. The Model Problem Solved: Docker Model Runner and GPU Acceleration

The first point of friction in any AI project is the model itself. Before you can write a single line of application code, you must conquer the three-headed hydra of model setup: Dependencies, CUDA, and GPU Access.

Docker Model Runner (or a standardized local server like Ollama) abstracts away this complexity. The model is treated as just another containerized service, simplifying the setup.

The single biggest barrier is GPU accessibility. Docker Compose solves this with declarative simplicity, ensuring the container running our LLM has direct, optimized access to your hardware:

YAML

# Snippet demonstrating GPU access via Compose
services:
  ollama: # Our LLM Service
    image: ollama/ollama # Use the standardized image
    ports:
      - “11434:11434”
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia # Crucial: requests the GPU driver
              count: all
              capabilities: [gpu]

This small configuration confirms that “hardware limits aren’t a blocker anymore.” The model, once running, exposes an OpenAI-compatible API, standardizing communication regardless of the model you choose.

3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)

This section demonstrates the core thesis - that Docker Compose makes the complex RAG architecture plug-and-play, instantly proving the claim that you can “define the whole AI stack... in a single YAML file.”

3.1 What We’re Building: The Private Q&A Bot

We are building a private, local Q&A service using a Retrieval-Augmented Generation (RAG) architecture.

Components

ollama - Generates text and embeddings (LLM/Embedding Model)
qdrant - Stores text embeddings (Vector Database)
rag-app - Orchestrates the RAG flow (Python API)

3.2 Project Structure

We will use two files and a folder in our project root. The folder will have two files:

rag-docker-poc/
├── docker-compose.yml  # The orchestrator
├── rag-app/
│   ├── Dockerfile      # Build instructions for our Python service
│   └── requirements.txt # Python dependencies
│   └── main.py # The main application file
└── .env                # Stores environment variables (e.g., model name)

3.3 Step-by-Step Code

Step 1: The Python Dependencies (rag-app/requirements.txt)

We need libraries to talk to Ollama and Qdrant.

rag-app/requirements.txt
ollama # This allows us to talk to the LLM/Embedding service
qdrant-client # This allows us to talk to the Vector DB
fastapi
uvicorn
pydantic
pypdf # For document loading
langchain-text-splitters # For the splitter utility

Step 2: The Application Container (rag-app/Dockerfile)

We define a lightweight container for our orchestrator API.

rag-app/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Use a simple entry command (e.g., to run a web server)
CMD [”uvicorn”, “main:app”, “--host”, “0.0.0.0”, “--port”, “8000”]

Step 3: The Main Application File (main.py)

We define a simple FastAPI application that ingests a document and stores it in Qdrant.

rag-app/main.py
import os
import uuid
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from qdrant_client import QdrantClient, models
import ollama
from ollama import Client as OllamaClient
from langchain_text_splitters import RecursiveCharacterTextSplitter

# --- Configuration (Pulled from Environment)
OLLAMA_HOST = os.getenv(”OLLAMA_HOST”)
QDRANT_HOST = os.getenv(”QDRANT_HOST”)
EMBEDDING_MODEL = os.getenv(”EMBEDDING_MODEL_NAME”, “all-minilm”)
GENERATOR_MODEL = os.getenv(”GENERATOR_MODEL_NAME”, “qwen2:0.5b”)
COLLECTION_NAME = “private_documents”
VECTOR_SIZE = 384

# --- Chunking Parameters (CRITICAL for RAG Quality)
CHUNK_SIZE = 500  # Number of characters per chunk
CHUNK_OVERLAP = 100 # Overlapping characters between chunks

# --- Initialization
app = FastAPI()
qdrant_client = QdrantClient(url=QDRANT_HOST)
ollama_client = OllamaClient(host=OLLAMA_HOST)

class Query(BaseModel):
    question: str

class Ingest(BaseModel):
    document_text: str

# --- RAG Functions ---

def create_collection_if_not_exists():
    “”“Ensures the Qdrant collection is ready for vectors.”“”
    if not qdrant_client.collection_exists(collection_name=COLLECTION_NAME):
        qdrant_client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=models.VectorParams(size=VECTOR_SIZE, distance=models.Distance.COSINE)
        )

# UPDATED: Real Text Splitting and Embedding Function**
def process_and_upsert_document(document_text: str):
    “”“Splits a document into chunks, embeds them, and uploads to Qdrant.”“”

    # 1. Initialize the Text Splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=[”\n\n”, “\n”, “ “, “”]
    )

    # 2. Split the Text
    chunks = text_splitter.split_text(document_text)

    # 3. Embed all chunks in a batch (for efficiency)
    embeddings_list = []
    points = []

    for chunk in chunks:
        # Generate the embedding vector for the chunk
        response = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=chunk)
        vector = response[’embedding’]

        # Create a unique point structure
        points.append(
            models.PointStruct(
                id=str(uuid.uuid4()),  # Use UUID string as unique ID
                vector=vector,
                payload={”text”: chunk, “source_doc”: “user_upload”}
            )
        )

    # 4. Upsert (upload) all points to Qdrant
    qdrant_client.upsert(
        collection_name=COLLECTION_NAME,
        points=points,
        wait=True
    )
    return len(chunks)

@app.on_event(”startup”)
def startup_event():
    “”“Runs on container startup.”“”
    create_collection_if_not_exists()

@app.post(”/ingest”)
def ingest_document(data: Ingest):
    “”“Ingests a document (text) into the vector database by splitting it.”“”
    num_chunks = process_and_upsert_document(data.document_text)
    return {”message”: f”Successfully indexed {num_chunks} chunks using {EMBEDDING_MODEL}”}

# The /query endpoint remains the same as it correctly handles retrieval and generation.
@app.post(”/query”)
def retrieve_and_generate(query: Query):
    “”“Retrieves context and asks the LLM to generate an answer.”“”

    # 1. Generate Query Vector
    query_vector = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=query.question)[’embedding’]

    # 2. Retrieve Context from Qdrant (using new API)
    search_result = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        query=query_vector,
        limit=2
    ).points

    # 3. Construct the RAG Prompt
    context = “\n”.join([hit.payload[’text’] for hit in search_result])
    prompt_template = f”“”
    Context: {context}
    Question: {query.question}
    Use the provided context to answer the question briefly.
    “”“

    # 4. Generate Answer using LLM
    response = ollama_client.generate(
        model=GENERATOR_MODEL,
        prompt=prompt_template,
    )

    return {”answer”: response[’response’], “context_used”: context, “retrieval_score_1”: search_result[0].score if search_result else None}

Step 4: The Orchestration Blueprint (docker-compose.yml)

This single file links the three services and handles all configuration.

YAML

# docker-compose.yml
services:
  # 1. The LLM and Embedding Model Service (Ollama)
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - “11434:11434”
    volumes:
      - ollama_data:/root/.ollama
    # CRITICAL: Enable GPU access for performance (remove if no GPU is available)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    entrypoint:
      - /bin/bash
      - -c
      - |
        ollama serve &
        sleep 5
        ollama pull all-minilm
        ollama pull qwen2:0.5b
        wait

    healthcheck:
      test: [”CMD”, “ollama”, “list”]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 300s # 5 minutes for initial model pulls

  # 2. The Vector Database Service (Qdrant)
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    ports:
      - “6333:6333”
      - “6334:6334”
    volumes:
      - qdrant_data:/qdrant/storage
    healthcheck:
      test: [”CMD-SHELL”, “bash -c ‘echo > /dev/tcp/localhost/6333’”]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 5s

  # 3. The RAG Orchestrator API (rag-app)
  rag-app:
    build:
      context: ./rag-app
      dockerfile: Dockerfile
    container_name: rag-app
    ports:
      - “8100:8100”
    environment:
      OLLAMA_HOST: http://ollama:11434
      QDRANT_HOST: http://qdrant:6333
      GENERATOR_MODEL_NAME: qwen2:0.5b
      EMBEDDING_MODEL_NAME: all-minilm
    depends_on:
      qdrant:
        condition: service_healthy
      ollama:
        condition: service_healthy

volumes:
  qdrant_data:
  ollama_data:

3.4 Ingestion

To prove the RAG pipeline's effectiveness, we will use a real, dense technical document. For this demonstration, we'll use a portion of the Kubernetes Documentation (a complex, multi-layered topic) as our private knowledge source - like we want to build a chat agent that answers any question related to Kubernetes. We ingest the document into Qdrant, and then use the LLM to generate embeddings for each chunk of text.

Sample Document Text (for Ingestion):

"A Pod is the smallest execution unit in Kubernetes. It represents a single instance of a running process in a cluster. Pods can contain one or more containers, which are guaranteed to be co-located on the same host and share resources. The resources shared include volumes and a unique cluster IP address. Controllers like Deployments manage Pods automatically, handling replication and self-healing. When a node fails, the Controller automatically creates an equivalent replacement Pod on a different healthy node."

3.5 The Final Proof: One Command

The complexity of three distinct services, specialized models, and GPU configuration is now hidden behind the unified Docker surface.

Command:

Bash

docker compose up -d

This single command replaces hours of manual installation. In moments, your entire RAG pipeline - LLM, Vector DB, and application - is running securely on your local machine, fully isolated, proving the 52% reduction in setup time claim.

3.6 Testing

Step 1: Ingest the Private Document

Send the sample text to our API's /ingest endpoint.

curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"document_text": "A Pod is the smallest execution unit in Kubernetes. It represents a single instance of a running process in a cluster. Pods can contain one or more containers, which are guaranteed to be co-located on the same host and share resources. The resources shared include volumes and a unique cluster IP address. Controllers like Deployments manage Pods automatically, handling replication and self-healing. When a node fails, the Controller automatically creates an equivalent replacement Pod on a different healthy node."}'

Output:

Step 2: Query the System

Ask a question that requires knowledge only found in the document.

curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What happens to a Pod when its node fails?"}'

Output:

4 The Security Imperative: Sandboxes and Isolation

The Docker vision is equally focused on security, which becomes paramount with highly capable, autonomous agents.

Agents need to run shell commands (npm install, git commit) to be useful. When running directly on your host machine, this capability is a liability, as a Prompt Injection attack can trick the agent into running malicious code (e.g., reading your ~/.ssh/keys or running destructive commands).

Docker addresses this by providing a structural solution: OS-level isolation. When our rag-app runs in its container, the agent can only see the files mounted into the container - our project directory. Any attempt to access sensitive files on the host system fails immediately with a “No such file or directory” error. This moves security from fragile application guardrails to a robust, structural defence that is always on.

5 Conclusion: The New Standard for AI Stack Orchestration

The fragmented toolchain that once defined AI development is now obsolete. By containerizing the entire workflow—from GPU-accelerated LLMs to complex multi-service architectures—Docker has successfully shifted the focus from infrastructure friction to pure innovation.

The statistics from TheCUBE research are not hyperbole; they are the logical outcome of a standardized platform:

52% faster setup: Proven by deploying a complete RAG pipeline with a single docker compose up command.
Secure: Confirmed by enforcing OS-level isolation, neutralizing the risks posed by highly autonomous agents.
Shareable: Demonstrated by packaging the entire stack (models, database, app) into a single, portable YAML file.

Docker's vision is clear, and as we've demonstrated, it's working. The future of AI development belongs to the standard, secure, and zero-friction containerized workflow.

Next Step: Optional

You can use the resources below to learn more -

DEV Community

Zero-Friction AI Dev: Build and Deploy Chatbots with a Single Docker Compose File

Table of Contents

1. The Unspoken Challenge of the Agentic Era

2. The Model Problem Solved: Docker Model Runner and GPU Acceleration

3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)

3.1. What We’re Building: The Private Q&A Bot

3.2. Project Structure

3.3. Step-by-Step Code

3.4. Ingestion

3.5. The Final Proof: One Command

3.6. Testing

4. The Security Imperative: Sandboxes and Isolation

5. Conclusion: The New Standard for AI Stack Orchestration

1. The Unspoken Challenge of the Agentic Era

Docker’s Mandate: Simplicity, Security, and Speed

The Proof: A Real-World RAG Pipeline

2. The Model Problem Solved: Docker Model Runner and GPU Acceleration

3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)

3.1 What We’re Building: The Private Q&A Bot

Components

3.2 Project Structure

3.3 Step-by-Step Code

Step 1: The Python Dependencies (rag-app/requirements.txt)

Step 2: The Application Container (rag-app/Dockerfile)

Step 3: The Main Application File (main.py)

Step 4: The Orchestration Blueprint (docker-compose.yml)

3.4 Ingestion

3.5 The Final Proof: One Command

3.6 Testing

Step 1: Ingest the Private Document

Step 2: Query the System

4 The Security Imperative: Sandboxes and Isolation

5 Conclusion: The New Standard for AI Stack Orchestration

Next Step: Optional

Top comments (0)