Table of Contents
1. The Unspoken Challenge of the Agentic Era
2. The Model Problem Solved: Docker Model Runner and GPU Acceleration
3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)
3.1. What We’re Building: The Private Q&A Bot
3.2. Project Structure
3.3. Step-by-Step Code
3.4. Ingestion
3.5. The Final Proof: One Command
3.6. Testing
4. The Security Imperative: Sandboxes and Isolation
5. Conclusion: The New Standard for AI Stack Orchestration
Project repo - RAG-Docker-POC
1. The Unspoken Challenge of the Agentic Era
Two months ago, I wanted to build a code swarm - a multi-agent code reviewer. The first struggle was memory. If you’re a developer, you know the dream of building the next great autonomous AI agent is often interrupted by the nightmare of the toolchain. You don’t just need Python; you need a specific version of CUDA, a compatible PyTorch install, a dependency manager that behaves, and a local machine strong enough to even host a small Large Language Model (LLM).
This initial friction is where most AI projects stall. It’s what leads to the feeling that AI development is fundamentally different and far more complicated than traditional software engineering.
But what if you could eliminate that friction? What if getting a complex, multi-service AI application running on your laptop took less time than installing a single Python library?
Docker’s Mandate: Simplicity, Security, and Speed
The vision, as laid out by Docker, is simple: to make building and running AI applications as easy, secure, and shareable as any other kind of software.
It’s a mandate that is already paying off:
TheCUBE Research confirms that 52% of users cut their AI project setup time by more than half, while 87% accelerated their time-to-market by at least 26%.
Docker's impact by TheCUBE Research
TheCUBE Research Report on Docker’s impact: Report
Docker achieves this transformation by bringing standardization to the two most chaotic parts of any AI project: the model itself and the complex infrastructure required to support it.
The Proof: A Real-World RAG Pipeline
To prove this dramatic reduction in friction and time-to-market, we are going to build a Full Retrieval-Augmented Generation (RAG) Pipeline - the foundation of modern and private AI applications.
Manually setting up the three core components - the LLM, the Vector Database, and the Orchestration Service - can take hours or even days. With Docker’s new tooling, it takes a single command: bash docker compose up
2. The Model Problem Solved: Docker Model Runner and GPU Acceleration
The first point of friction in any AI project is the model itself. Before you can write a single line of application code, you must conquer the three-headed hydra of model setup: Dependencies, CUDA, and GPU Access.
Docker Model Runner (or a standardized local server like Ollama) abstracts away this complexity. The model is treated as just another containerized service, simplifying the setup.
The single biggest barrier is GPU accessibility. Docker Compose solves this with declarative simplicity, ensuring the container running our LLM has direct, optimized access to your hardware:
YAML
# Snippet demonstrating GPU access via Compose
services:
ollama: # Our LLM Service
image: ollama/ollama # Use the standardized image
ports:
- “11434:11434”
deploy:
resources:
reservations:
devices:
- driver: nvidia # Crucial: requests the GPU driver
count: all
capabilities: [gpu]
This small configuration confirms that “hardware limits aren’t a blocker anymore.” The model, once running, exposes an OpenAI-compatible API, standardizing communication regardless of the model you choose.
3. Orchestration: The Single YAML File That Simplifies the Stack (PoC)
This section demonstrates the core thesis - that Docker Compose makes the complex RAG architecture plug-and-play, instantly proving the claim that you can “define the whole AI stack... in a single YAML file.”
3.1 What We’re Building: The Private Q&A Bot
We are building a private, local Q&A service using a Retrieval-Augmented Generation (RAG) architecture.
Components
- ollama - Generates text and embeddings (LLM/Embedding Model)
- qdrant - Stores text embeddings (Vector Database)
- rag-app - Orchestrates the RAG flow (Python API)
3.2 Project Structure
We will use two files and a folder in our project root. The folder will have two files:
rag-docker-poc/
├── docker-compose.yml # The orchestrator
├── rag-app/
│ ├── Dockerfile # Build instructions for our Python service
│ └── requirements.txt # Python dependencies
│ └── main.py # The main application file
└── .env # Stores environment variables (e.g., model name)
3.3 Step-by-Step Code
Step 1: The Python Dependencies (rag-app/requirements.txt)
We need libraries to talk to Ollama and Qdrant.
rag-app/requirements.txt
ollama # This allows us to talk to the LLM/Embedding service
qdrant-client # This allows us to talk to the Vector DB
fastapi
uvicorn
pydantic
pypdf # For document loading
langchain-text-splitters # For the splitter utility
Step 2: The Application Container (rag-app/Dockerfile)
We define a lightweight container for our orchestrator API.
rag-app/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Use a simple entry command (e.g., to run a web server)
CMD [”uvicorn”, “main:app”, “--host”, “0.0.0.0”, “--port”, “8000”]
Step 3: The Main Application File (main.py)
We define a simple FastAPI application that ingests a document and stores it in Qdrant.
rag-app/main.py
import os
import uuid
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from qdrant_client import QdrantClient, models
import ollama
from ollama import Client as OllamaClient
from langchain_text_splitters import RecursiveCharacterTextSplitter
# --- Configuration (Pulled from Environment)
OLLAMA_HOST = os.getenv(”OLLAMA_HOST”)
QDRANT_HOST = os.getenv(”QDRANT_HOST”)
EMBEDDING_MODEL = os.getenv(”EMBEDDING_MODEL_NAME”, “all-minilm”)
GENERATOR_MODEL = os.getenv(”GENERATOR_MODEL_NAME”, “qwen2:0.5b”)
COLLECTION_NAME = “private_documents”
VECTOR_SIZE = 384
# --- Chunking Parameters (CRITICAL for RAG Quality)
CHUNK_SIZE = 500 # Number of characters per chunk
CHUNK_OVERLAP = 100 # Overlapping characters between chunks
# --- Initialization
app = FastAPI()
qdrant_client = QdrantClient(url=QDRANT_HOST)
ollama_client = OllamaClient(host=OLLAMA_HOST)
class Query(BaseModel):
question: str
class Ingest(BaseModel):
document_text: str
# --- RAG Functions ---
def create_collection_if_not_exists():
“”“Ensures the Qdrant collection is ready for vectors.”“”
if not qdrant_client.collection_exists(collection_name=COLLECTION_NAME):
qdrant_client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=models.VectorParams(size=VECTOR_SIZE, distance=models.Distance.COSINE)
)
# UPDATED: Real Text Splitting and Embedding Function**
def process_and_upsert_document(document_text: str):
“”“Splits a document into chunks, embeds them, and uploads to Qdrant.”“”
# 1. Initialize the Text Splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=[”\n\n”, “\n”, “ “, “”]
)
# 2. Split the Text
chunks = text_splitter.split_text(document_text)
# 3. Embed all chunks in a batch (for efficiency)
embeddings_list = []
points = []
for chunk in chunks:
# Generate the embedding vector for the chunk
response = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=chunk)
vector = response[’embedding’]
# Create a unique point structure
points.append(
models.PointStruct(
id=str(uuid.uuid4()), # Use UUID string as unique ID
vector=vector,
payload={”text”: chunk, “source_doc”: “user_upload”}
)
)
# 4. Upsert (upload) all points to Qdrant
qdrant_client.upsert(
collection_name=COLLECTION_NAME,
points=points,
wait=True
)
return len(chunks)
@app.on_event(”startup”)
def startup_event():
“”“Runs on container startup.”“”
create_collection_if_not_exists()
@app.post(”/ingest”)
def ingest_document(data: Ingest):
“”“Ingests a document (text) into the vector database by splitting it.”“”
num_chunks = process_and_upsert_document(data.document_text)
return {”message”: f”Successfully indexed {num_chunks} chunks using {EMBEDDING_MODEL}”}
# The /query endpoint remains the same as it correctly handles retrieval and generation.
@app.post(”/query”)
def retrieve_and_generate(query: Query):
“”“Retrieves context and asks the LLM to generate an answer.”“”
# 1. Generate Query Vector
query_vector = ollama_client.embeddings(model=EMBEDDING_MODEL, prompt=query.question)[’embedding’]
# 2. Retrieve Context from Qdrant (using new API)
search_result = qdrant_client.query_points(
collection_name=COLLECTION_NAME,
query=query_vector,
limit=2
).points
# 3. Construct the RAG Prompt
context = “\n”.join([hit.payload[’text’] for hit in search_result])
prompt_template = f”“”
Context: {context}
Question: {query.question}
Use the provided context to answer the question briefly.
“”“
# 4. Generate Answer using LLM
response = ollama_client.generate(
model=GENERATOR_MODEL,
prompt=prompt_template,
)
return {”answer”: response[’response’], “context_used”: context, “retrieval_score_1”: search_result[0].score if search_result else None}
Step 4: The Orchestration Blueprint (docker-compose.yml)
This single file links the three services and handles all configuration.
YAML
# docker-compose.yml
services:
# 1. The LLM and Embedding Model Service (Ollama)
ollama:
image: ollama/ollama
container_name: ollama
ports:
- “11434:11434”
volumes:
- ollama_data:/root/.ollama
# CRITICAL: Enable GPU access for performance (remove if no GPU is available)
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
entrypoint:
- /bin/bash
- -c
- |
ollama serve &
sleep 5
ollama pull all-minilm
ollama pull qwen2:0.5b
wait
healthcheck:
test: [”CMD”, “ollama”, “list”]
interval: 30s
timeout: 10s
retries: 10
start_period: 300s # 5 minutes for initial model pulls
# 2. The Vector Database Service (Qdrant)
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
ports:
- “6333:6333”
- “6334:6334”
volumes:
- qdrant_data:/qdrant/storage
healthcheck:
test: [”CMD-SHELL”, “bash -c ‘echo > /dev/tcp/localhost/6333’”]
interval: 10s
timeout: 5s
retries: 5
start_period: 5s
# 3. The RAG Orchestrator API (rag-app)
rag-app:
build:
context: ./rag-app
dockerfile: Dockerfile
container_name: rag-app
ports:
- “8100:8100”
environment:
OLLAMA_HOST: http://ollama:11434
QDRANT_HOST: http://qdrant:6333
GENERATOR_MODEL_NAME: qwen2:0.5b
EMBEDDING_MODEL_NAME: all-minilm
depends_on:
qdrant:
condition: service_healthy
ollama:
condition: service_healthy
volumes:
qdrant_data:
ollama_data:
3.4 Ingestion
To prove the RAG pipeline's effectiveness, we will use a real, dense technical document. For this demonstration, we'll use a portion of the Kubernetes Documentation (a complex, multi-layered topic) as our private knowledge source - like we want to build a chat agent that answers any question related to Kubernetes. We ingest the document into Qdrant, and then use the LLM to generate embeddings for each chunk of text.
Sample Document Text (for Ingestion):
"A Pod is the smallest execution unit in Kubernetes. It represents a single instance of a running process in a cluster. Pods can contain one or more containers, which are guaranteed to be co-located on the same host and share resources. The resources shared include volumes and a unique cluster IP address. Controllers like Deployments manage Pods automatically, handling replication and self-healing. When a node fails, the Controller automatically creates an equivalent replacement Pod on a different healthy node."
3.5 The Final Proof: One Command
The complexity of three distinct services, specialized models, and GPU configuration is now hidden behind the unified Docker surface.
Command:
Bash
docker compose up -d
This single command replaces hours of manual installation. In moments, your entire RAG pipeline - LLM, Vector DB, and application - is running securely on your local machine, fully isolated, proving the 52% reduction in setup time claim.
3.6 Testing
Step 1: Ingest the Private Document
Send the sample text to our API's /ingest endpoint.
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"document_text": "A Pod is the smallest execution unit in Kubernetes. It represents a single instance of a running process in a cluster. Pods can contain one or more containers, which are guaranteed to be co-located on the same host and share resources. The resources shared include volumes and a unique cluster IP address. Controllers like Deployments manage Pods automatically, handling replication and self-healing. When a node fails, the Controller automatically creates an equivalent replacement Pod on a different healthy node."}'
Output:
Step 2: Query the System
Ask a question that requires knowledge only found in the document.
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What happens to a Pod when its node fails?"}'
Output:
4 The Security Imperative: Sandboxes and Isolation
The Docker vision is equally focused on security, which becomes paramount with highly capable, autonomous agents.
Agents need to run shell commands (npm install, git commit) to be useful. When running directly on your host machine, this capability is a liability, as a Prompt Injection attack can trick the agent into running malicious code (e.g., reading your ~/.ssh/keys or running destructive commands).
Docker addresses this by providing a structural solution: OS-level isolation. When our rag-app runs in its container, the agent can only see the files mounted into the container - our project directory. Any attempt to access sensitive files on the host system fails immediately with a “No such file or directory” error. This moves security from fragile application guardrails to a robust, structural defence that is always on.
5 Conclusion: The New Standard for AI Stack Orchestration
The fragmented toolchain that once defined AI development is now obsolete. By containerizing the entire workflow—from GPU-accelerated LLMs to complex multi-service architectures—Docker has successfully shifted the focus from infrastructure friction to pure innovation.
The statistics from TheCUBE research are not hyperbole; they are the logical outcome of a standardized platform:
- 52% faster setup: Proven by deploying a complete RAG pipeline with a single docker compose up command.
- Secure: Confirmed by enforcing OS-level isolation, neutralizing the risks posed by highly autonomous agents.
- Shareable: Demonstrated by packaging the entire stack (models, database, app) into a single, portable YAML file.
Docker's vision is clear, and as we've demonstrated, it's working. The future of AI development belongs to the standard, secure, and zero-friction containerized workflow.
Next Step: Optional
You can use the resources below to learn more -


Top comments (0)