The Bill That Broke the Architecture
In early 2026, a founder I know got his first real AWS + API bill after three months of building. The number was not catastrophic. It was worse than that: it was predictable. Every new user, every new query, every new document ingested into the knowledge base added a fixed marginal cost he could not engineer away. The architecture was correct. The economics were not.
This is the scenario most tutorials skip. They show you how to build the thing. They do not show you what happens when the thing works and the invoices start compounding. According to McKinsey's The State of AI in 2024 (source), organizations are increasingly adopting open-source AI frameworks and self-hosted components specifically to reduce costs and accelerate deployment of production applications. The shift is not ideological. It is financial.
What follows is a layer-by-layer breakdown of the open-source stack we use and recommend: what each component does, which tools fill each role, and where the approach genuinely breaks down.
The Stack, Layer by Layer
A production AI application has roughly six layers: the inference layer (the LLM itself), the orchestration layer (how you chain calls and manage state), the retrieval layer (RAG and vector storage), the data layer (where documents and records live), the interface layer (how users or systems interact), and the deployment layer (how it runs continuously). Proprietary stacks charge at every one of these. Open-source stacks charge at none of them, with tradeoffs we will get to.
Inference: Local LLMs via Ollama
Ollama is the fastest path to running Llama 3, Mistral, and Phi-3 locally. Install it, pull a model, and you have an OpenAI-compatible API endpoint on localhost:11434. No API key. No rate limits. No per-token billing. For most classification, summarization, and structured extraction tasks, a quantized 7B or 13B parameter version of Mistral or Llama 3 performs comparably to the hosted APIs that cost money per call.
The honest limitation: local inference requires hardware. A machine with 16GB of unified memory (an M2 MacBook Pro, for instance) runs 7B parameter variants comfortably. Anything larger needs more RAM or a dedicated GPU. If your team works on underpowered laptops, "free" inference still has a hardware cost. And for genuinely complex reasoning tasks, the gap between a quantized open-source variant and a frontier reasoning engine is real. Do not pretend otherwise.
Orchestration: n8n
n8n is the orchestration layer we reach for first. Self-hosted via Docker, it connects to local LLM endpoints, external APIs, databases, and webhooks without a per-execution fee. The visual workflow builder makes it fast to prototype; the underlying JSON is version-controllable and auditable. For teams building automation chains that need to call an LLM, write to a database, send a notification, and loop back, n8n handles all of it without a SaaS subscription. You can see the range of what this enables in our full blueprint catalog.
Where n8n's self-hosted version shows its limits: complex branching logic with dozens of nodes gets visually unwieldy. Error handling requires deliberate design. If your team has no one comfortable reading node-level JSON, the maintenance burden accumulates.
Retrieval: Qdrant or Weaviate
Self-hosted retrieval-augmented generation pipelines are now genuinely straightforward. Qdrant runs as a single Docker container and exposes a REST and gRPC API for vector similarity search. Weaviate offers a similar footprint with a slightly richer query language. Both support hybrid search (dense vectors plus keyword matching), which matters for business documents where exact terminology is as important as semantic meaning.
The pipeline looks like this: ingest documents, chunk them, embed each chunk using a local embedding model (nomic-embed-text via Ollama works well), store the vectors in Qdrant, and at query time retrieve the top-k chunks before passing them to the LLM. The entire chain runs on your own infrastructure. No third-party SaaS touches your documents.
The tradeoff is operational. You own the uptime. If the Qdrant container crashes at 2am, no vendor support team fixes it. You need monitoring, restart policies, and someone who knows how to read container logs.
Data Layer: PostgreSQL + MinIO
PostgreSQL handles structured records. MinIO handles object storage (PDFs, audio files, raw exports) with an S3-compatible API, which means any tool that writes to S3 writes to MinIO without code changes. Both are mature, well-documented, and free to self-host. This combination covers the data layer for the vast majority of business automation use cases.
Deployment: Docker Compose, then Kubernetes if you must
Start with Docker Compose. A single docker-compose.yml file can define your n8n instance, Qdrant, PostgreSQL, MinIO, and Ollama together. One command brings the entire stack up. For most indie projects and early-stage startups, this is sufficient for months.
Kubernetes is the right answer when you need horizontal scaling, rolling deployments, or multi-region redundancy. It is not the right answer on day one. The operational complexity of a Kubernetes cluster is a real cost, even if the software is free.
The Provider Consolidation Lesson
We learned something counterintuitive building an early version of an autonomous outreach pipeline. The original architecture used three separate providers: one for research queries, one for lead scoring, one for writing. The per-operation cost was fractionally cheaper than using a single provider's full model lineup.
We scrapped it anyway.
Three API keys, three billing dashboards, three status pages to check when something breaks, three sets of rate limits to manage. The marginal cost savings did not survive contact with the operational reality of maintaining that many integrations. Every blueprint we build now runs on a single provider's lineup. One credential to configure, one bill to track, one status page to bookmark. The simplicity compounds over time in ways the cost calculation does not capture upfront.
The same principle applies to the open-source stack. The temptation is to pick the best tool for each layer independently: the fastest vector database, the most accurate embedding model, the most feature-rich orchestrator. Resist it. A coherent stack you understand deeply outperforms an optimal stack you are constantly debugging. This is especially true for teams without dedicated infrastructure engineers. For more on how architecture decisions affect operational overhead, our piece on AI back-office workflows versus hiring staff covers the tradeoff honestly.
When This Approach Breaks Down
The open-source self-hosted stack is not the right answer for every situation. Here is where it fails.
First, regulated industries. If you are processing healthcare records, financial data subject to SOC 2 audits, or anything under GDPR with strict data residency requirements, self-hosting is not automatically safer. It shifts the compliance burden entirely onto you. A managed cloud provider with existing certifications may be cheaper in total cost once legal review is factored in.
Second, teams without infrastructure experience. Running Ollama on a developer laptop is trivial. Running it reliably in production, with GPU acceleration, automatic restarts, load balancing across multiple instances, and proper logging, requires real systems knowledge. If your team's expertise is in product and application code, the hidden cost of learning infrastructure can exceed the API bills you were trying to avoid.
Third, frontier reasoning tasks. The gap between a locally-run open-source variant and a frontier reasoning engine narrows every quarter, but it has not closed. For tasks requiring multi-step logical deduction, nuanced judgment, or synthesis across long contexts, the best open-source options still trail the best proprietary ones. Know which category your use case falls into before committing to a stack.
Fourth, time-to-market pressure. A self-hosted stack takes days to configure correctly. A hosted API takes minutes. If you are validating a product hypothesis and need to move in hours, the managed API is the right call. Optimize infrastructure after you have confirmed the thing is worth building.
What We'd Do Differently
Start with the data layer, not the inference layer. Most teams spend their first week choosing between LLMs and their second week realizing their documents are in five different formats with inconsistent structure. The quality of your retrieval pipeline depends almost entirely on how clean and consistently chunked your source data is. We would spend the first sprint entirely on ingestion and normalization before touching a vector database or an LLM.
Build the monitoring layer before you need it. The open-source stack has no built-in observability. Langfuse is free to self-host and gives you trace-level visibility into every LLM call: latency, token counts, input/output pairs, and error rates. We have shipped stacks without it and regretted it every time something broke in production and we had no logs to diagnose from.
Treat provider consolidation as a first-class architectural constraint, not an afterthought. The multi-provider architecture we described earlier looked optimal on a spreadsheet. It was not optimal in practice. Before finalizing any stack, ask: how many credentials does a new team member need to configure to run this locally? If the answer is more than two, the architecture is more complex than it needs to be.
Top comments (0)