You've seen the demos. You want to run AutoBot on your own hardware, your own data, under your own control. Good instinct. Here's the full operational picture — Docker Compose internals, how to match LLM models to your GPU or CPU, and the production habits that keep things stable long-term.
Why Self-Host?
AutoBot's tagline is "Your data. Your AI." That's not marketing copy — it's an architectural choice. When you self-host:
- Conversations never leave your network
- You choose which models run (open-weight, cloud API, or a mix)
- Upgrade timing is yours to control
- No per-seat pricing surprises
The trade-off is operational responsibility. This post is about making that trade-off comfortable.
Docker Compose Deep Dive
AutoBot ships with a docker-compose.yml that wires together several services. Let's walk through each layer.
Services Overview
services:
backend:
build: ./backend
ports: ["8000:8000"]
depends_on: [chromadb, redis]
environment:
- OLLAMA_HOST=http://ollama:11434
- CHROMA_HOST=chromadb
- REDIS_URL=redis://redis:6379
frontend:
build: ./frontend
ports: ["3000:3000"]
depends_on: [backend]
chromadb:
image: chromadb/chroma:latest
volumes:
- chroma_data:/chroma/chroma
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
command: redis-server --appendonly yes
ollama:
image: ollama/ollama:latest
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
chroma_data:
redis_data:
ollama_models:
What Each Service Does
backend — FastAPI application. Handles chat sessions, RAG retrieval, fleet management. The OLLAMA_HOST env var points it at your local model server; swap this for an OpenAI-compatible URL to use a cloud LLM instead.
frontend — Next.js UI. Talks only to the backend on port 8000. Stateless — you can restart it without losing anything.
chromadb — Vector database for knowledge bases. Your embedded documents live here. The chroma_data volume is critical — back it up.
redis — Session state and task queues. With --appendonly yes, Redis persists to disk. Losing this volume means losing active session context (but not your knowledge bases).
ollama — Local LLM inference server. Holds downloaded model weights in ollama_models. Models are large (4–70 GB each); this volume is expensive to rebuild.
Networking
All services communicate on a default Docker bridge network. The service names (chromadb, redis, ollama) resolve as hostnames inside the network — that's why the backend config uses http://ollama:11434 rather than localhost.
For a production deployment, consider an explicit network definition:
networks:
autobot_net:
driver: bridge
services:
backend:
networks: [autobot_net]
# ... same for all services
This lets you add an Nginx reverse proxy or Traefik on the same network without exposing internal ports.
Model Sizing to Hardware
This is where most self-hosting guides go wrong — they talk about VPS pricing instead of the actual constraint: inference throughput vs. memory bandwidth.
The Rule of Thumb
A model running entirely in VRAM is fast. A model that spills to RAM (or worse, disk) is slow. Plan your setup so your primary model fits in VRAM with room for the OS and other processes.
| Hardware | VRAM | Practical Model Ceiling |
|---|---|---|
| RTX 3060 | 12 GB | Llama 3 8B (Q4), Mistral 7B |
| RTX 3090 / 4090 | 24 GB | Llama 3 70B (Q4 at the edge), Llama 3 8B (full precision) |
| 2× A100 80 GB | 160 GB | Llama 3 70B (full), most open-weight frontier models |
| CPU only (32 GB RAM) | — | Llama 3 8B (Q4, slow) — workable for low-traffic RAG |
Local Ollama vs. Cloud LLM Trade-offs
AutoBot supports both. Here's how to think about the choice:
Local Ollama (default)
- Zero per-token cost
- Private by definition
- Latency depends on your hardware
- Best for: high-volume internal tools, sensitive data, experimentation
Cloud LLM (OpenAI, Anthropic, etc.)
- Pay per token
- Faster for large models you can't run locally
- Data leaves your network (check your provider's retention policy)
- Best for: production apps that need frontier model quality without buying GPUs
The OLLAMA_HOST env var makes switching simple. Point it at https://api.openai.com/v1 (with an OpenAI-compatible wrapper) to route through a cloud provider without touching application code.
Practical Model Recommendations
For a RAG-heavy knowledge base workload (most AutoBot deployments): a quantized 8B model (Llama 3.1 8B Q4_K_M) hits the sweet spot — fast enough for real-time chat, accurate enough for document retrieval, fits comfortably on a single consumer GPU.
For a multi-agent fleet workload: consider running a smaller model (3B–7B) per agent node and reserving a larger model for orchestration decisions. AutoBot's fleet manager is built to handle per-agent model config.
Production Tips
Backups
The three volumes that matter:
# ChromaDB — your knowledge bases
docker run --rm \
-v autobot_chroma_data:/source \
-v /backup:/backup \
alpine tar czf /backup/chroma-$(date +%Y%m%d).tar.gz -C /source .
# Redis — session state
docker exec autobot-redis-1 redis-cli BGSAVE
docker cp autobot-redis-1:/data/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
# Ollama models — large, but painful to re-download
docker run --rm \
-v autobot_ollama_models:/source \
-v /backup:/backup \
alpine tar czf /backup/ollama-$(date +%Y%m%d).tar.gz -C /source .
Run chroma and redis backups daily. Ollama models only change when you pull new ones — back up on change, not on schedule.
Upgrades
# Pull latest images
docker compose pull
# Recreate containers (zero-downtime if you add a load balancer)
docker compose up -d --no-deps --build backend frontend
# Full restart (brief downtime)
docker compose down && docker compose up -d
Pin image tags in production (chromadb/chroma:0.5.3 not latest) so upgrades are deliberate, not automatic.
Monitoring
AutoBot's backend exposes a /health endpoint. Wire it into your monitoring stack:
# Simple cron healthcheck
*/5 * * * * curl -sf http://localhost:8000/health || notify-oncall
For metrics, the backend emits structured logs to stdout. Forward them to Loki, Datadog, or whatever you already use:
backend:
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
Watch for these signals:
- ChromaDB query latency > 2s — index fragmentation or under-resourced container
-
Redis memory approaching limit — set
maxmemoryand a sensible eviction policy (allkeys-lru) - Ollama inference time spiking — model being swapped to RAM; consider reducing context length or switching to a smaller quantization
What's Next
Self-hosting is the start, not the finish. Once you're running in production, the interesting work is building knowledge bases, connecting data sources, and wiring up agents for your specific workflows.
If you want to help make AutoBot better at the infrastructure layer, there are open issues tagged for DevOps contributors:
→ Good first issues — DevOps label on AutoBot-AI
If AutoBot is saving you money or time on your infra, consider supporting development:
Questions, corrections, or war stories from your own deployment — drop them in the comments.
Top comments (0)