<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Becher Hilal</title>
    <description>The latest articles on DEV Community by Becher Hilal (@bash-thedev).</description>
    <link>https://dev.to/bash-thedev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855404%2Fabbef9d5-ab88-480c-bcf7-f632bd47c587.JPG</url>
      <title>DEV Community: Becher Hilal</title>
      <link>https://dev.to/bash-thedev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bash-thedev"/>
    <language>en</language>
    <item>
      <title>The Stack Nobody Recommended</title>
      <dc:creator>Becher Hilal</dc:creator>
      <pubDate>Sun, 05 Apr 2026 11:41:13 +0000</pubDate>
      <link>https://dev.to/bash-thedev/the-stack-nobody-recommended-3gja</link>
      <guid>https://dev.to/bash-thedev/the-stack-nobody-recommended-3gja</guid>
      <description>&lt;p&gt;The most common question I got after publishing &lt;a href="https://dev.to/bash-thedev/why-i-run-22-docker-services-at-home-23cj"&gt;Part 1&lt;/a&gt; was some variation of "why did you pick X instead of Y?" So this post is about that. Every major technology choice, what I actually considered, where I was right, and where I got lucky.&lt;/p&gt;

&lt;p&gt;I'll be upfront: some of these were informed decisions. Some were "I already know this tool, and I need to move fast." Both are valid, but they lead to different trade-offs down the line.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Backend: FastAPI
&lt;/h2&gt;

&lt;p&gt;I come from JavaScript and TypeScript. Years of React on the frontend, Express and Fastify on the backend. When I decided this project would be Python, because that's where the AI/ML ecosystem lives, I needed something that didn't feel foreign.&lt;/p&gt;

&lt;p&gt;FastAPI clicked immediately. The async/await model, the decorator-based routing, and type hints that actually do something. It felt like writing Fastify in Python. That familiarity wasn't the whole reason, but I'd be lying if I said it wasn't a factor.&lt;/p&gt;

&lt;p&gt;The technical reasons held up though. The system handles concurrent webhook callbacks from n8n, real-time polling from the React dashboard, and persistent asyncpg connections to PostgreSQL. All of that is async I/O, and FastAPI was built around that pattern. Django's async support exists now, but it still feels like it was added after the fact rather than designed in.&lt;/p&gt;

&lt;p&gt;I also deliberately avoided using an ORM. Every query in the system is hand-written SQL through asyncpg. With 95+ tables across 9 domains, I wanted to see exactly what was hitting the database. No magic, no N+1 surprises, no migration framework generating SQL I haven't read.&lt;/p&gt;

&lt;p&gt;The price I paid for skipping Django? No free admin panel; I built a React dashboard from scratch, which took weeks. No built-in migration system, I manage schema changes with raw SQL files piped through SSH into Docker, which has bitten me more than once (shell quoting across SSH → Docker → psql mangles complex statements). And a thinner plugin ecosystem when I need something that Django has had for 20 years.&lt;/p&gt;

&lt;p&gt;If you're building a web app with user accounts, admin panels, and forms,  just use Django. FastAPI makes sense when your backend is an API layer coordinating between services, which is my situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Database: PostgreSQL
&lt;/h2&gt;

&lt;p&gt;This wasn't a difficult decision. My data is deeply relational, transactions link to bank accounts, email classifications reference messages, knowledge facts get reinforced across multiple sources, scheduler tasks reference agents that reference models. Trying to do this in MongoDB would mean denormalizing everything, embedding documents within documents, and handling consistency manually.&lt;/p&gt;

&lt;p&gt;But PostgreSQL gives me things beyond just relational storage that turned out to be critical.&lt;/p&gt;

&lt;p&gt;LISTEN/NOTIFY replaced what would normally require a message queue. When an email gets classified, a trigger fires a notification. The brain service catches it in milliseconds via asyncpg and reacts. No Kafka, no RabbitMQ; just a built-in feature that's been in Postgres for years:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;notify_email_classified&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="n"&gt;PERFORM&lt;/span&gt; &lt;span class="n"&gt;pg_notify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'email_classified'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json_build_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s1"&gt;'id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s1"&gt;'urgency'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urgency&lt;/span&gt;
        &lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;plpgsql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At my scale (maybe 50-100 events per hour), this is more than enough. Adding Kafka would mean another container, another config to maintain, and another thing that can go wrong at 3am. I'll add it when I actually need it.&lt;/p&gt;

&lt;p&gt;CHECK constraints turned out to be one of the best decisions in the whole project. The database enforces what categories the AI is allowed to output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'billing'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shipping'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'subscription'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'employment'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'legal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'marketing'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'personal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'automated'&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLMs ignore your instructions sometimes. The extractor once invented a category that wasn't in the allowed list, and the INSERT failed. That's exactly what should happen: a loud failure is infinitely better than silently polluting your data with invalid categories.&lt;/p&gt;

&lt;p&gt;I also use window functions and interval queries for rate limiting, cooldowns, and circuit breakers. All things you'd normally reach for Redis to do. One fewer container in the stack.&lt;/p&gt;

&lt;p&gt;Where MongoDB would win: truly document-shaped data with variable schemas. CMS content, user profiles with heterogeneous fields, and event logs with different payloads. My data isn't any of those things.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Workflow Engine: n8n
&lt;/h2&gt;

&lt;p&gt;This is the decision I have the most complicated feelings about.&lt;/p&gt;

&lt;p&gt;n8n is a self-hosted visual workflow editor. You wire together triggers, HTTP requests, database queries, and code nodes. For my email pipelines, being able to see the flow as a diagram is genuinely valuable. When something breaks, I can see exactly which step failed and what data it had.&lt;/p&gt;

&lt;p&gt;The self-hosting angle ruled out Zapier and Make immediately. My workflows process email bodies and financial data. That doesn't go through a third party. And n8n's code nodes let me drop JavaScript directly into a workflow step, which is how I build the complex JSON payloads for Ollama calls.&lt;/p&gt;

&lt;p&gt;But n8n has caused more production incidents than any other component in the system. Scheduled workflows that overlap because n8n doesn't prevent concurrent executions by default. I had to build a database-level guard to check whether a previous run was still in progress. The API silently truncates long SQL queries without any error. Code nodes run in a sandboxed V8 isolate where &lt;code&gt;process.env&lt;/code&gt; doesn't exist (you need &lt;code&gt;$env&lt;/code&gt; instead), and building JSON in HTTP Request expressions is fragile enough that complex payloads should always go through a Code node first.&lt;/p&gt;

&lt;p&gt;None of these are dealbreakers individually. But collectively, n8n demands a level of defensive programming that I didn't expect from a workflow tool. Every workflow that involves an LLM call now has a stacking check, every SQL query gets verified after deployment, and I've learned to build payloads in Code nodes instead of expression fields.&lt;/p&gt;

&lt;p&gt;If your workflows are mostly code with minimal visual benefit, write Python scripts with a scheduler. The visual editor is n8n's actual advantage. If you don't need it, you're adding complexity for nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local LLM Serving: Ollama
&lt;/h2&gt;

&lt;p&gt;Ollama won on simplicity and nothing else. Install it, &lt;code&gt;ollama pull qwen3:14b&lt;/code&gt;, and there's a model serving API on &lt;code&gt;localhost:11434&lt;/code&gt;. No CUDA configuration, no Python environment management, no Docker GPU passthrough headaches.&lt;/p&gt;

&lt;p&gt;Switching between models is changing one string in the request payload. The API is consistent across every model (&lt;code&gt;/api/chat&lt;/code&gt;, &lt;code&gt;/api/generate&lt;/code&gt;, &lt;code&gt;/api/embed&lt;/code&gt;), which makes the routing logic in my system trivial.&lt;/p&gt;

&lt;p&gt;What I gave up: vLLM offers tensor parallelism, continuous batching, and quantization control that Ollama hides behind its abstraction. For a platform serving many concurrent users, vLLM is the right choice. For a single-user system running one model at a time on a Mac mini, Ollama's defaults are fine, and the setup time difference is measured in hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Communication: Mattermost (For Now)
&lt;/h2&gt;

&lt;p&gt;I need human-in-the-loop approval for every consequential action. The system posts to a chat with context and Approve/Reject buttons. I click, a webhook fires, and the workflow continues.&lt;/p&gt;

&lt;p&gt;I picked Mattermost because it's open source, self-hosted, and has interactive message attachments. That was the full evaluation. It wasn't strategic. iIt was "this runs in Docker and has buttons."&lt;/p&gt;

&lt;p&gt;It works. But I'm planning to migrate to Rocket.Chat. I want voice interaction with the assistant eventually, and Mattermost's audio calling is limited. Rocket.Chat also has more mature mobile apps, which matters because the whole point of HITL is approving actions when I'm not at my desk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Networking: Tailscale
&lt;/h2&gt;

&lt;p&gt;Three machines need to talk to each other. Tailscale gives each one a stable IP that works regardless of the physical network. No port forwarding, no dynamic DNS, no opening ports on my router. Setup took about 10 minutes.&lt;/p&gt;

&lt;p&gt;I could have configured WireGuard manually for the same encryption and performance, but then I'd be managing key rotation, endpoint configs, and NAT traversal myself. For a three-node network, Tailscale's convenience is worth it.&lt;/p&gt;

&lt;p&gt;One thing people ask: why not Cloudflare Tunnels? Because they solve a different problem. Cloudflare Tunnels expose services to the internet through Cloudflare's network. My services don't need to be on the internet; they need to talk to each other privately. Mesh VPN, not reverse proxy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Search: Elasticsearch (Added Later)
&lt;/h2&gt;

&lt;p&gt;I didn't start with Elasticsearch. I started with ChromaDB because it's lighter, runs in Docker, has a simple Python API, and is good enough for basic vector search.&lt;/p&gt;

&lt;p&gt;The problem showed up when the knowledge base grew. I had thousands of facts, entities, and patterns, and I needed to search by meaning &lt;em&gt;and&lt;/em&gt; by exact keywords in the same query. ChromaDB handles vectors. PostgreSQL handles keywords. But running two searches across two systems and merging results is fragile and slow.&lt;/p&gt;

&lt;p&gt;Elasticsearch does both natively (BM25 for exact keyword matching, kNN for vector similarity) in a single query. That's what made me migrate. The trade-off is 4GB of heap memory on a machine that was already tight. For smaller datasets or pure vector search, ChromaDB or pgvector are lighter options.&lt;/p&gt;

&lt;p&gt;I'll cover the migration in a dedicated post.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Deployment. Right now, I deploy by SSH-ing into a Windows host and running PowerShell commands. No CI/CD, no GitHub Actions. It works because I'm the only developer, but it's the first thing that would break if anyone else needed to contribute.&lt;/p&gt;

&lt;p&gt;If I started over, Linux from day one and a basic GitHub Actions pipeline; push to main, build container, deploy. Not Kubernetes, not Terraform. Just automating the 90-second script I currently run manually.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 2 of "&lt;a href="https://dev.to/bash-thedev/series/38113"&gt;One Developer, 22 Containers&lt;/a&gt;". Next up: migrating from ChromaDB to Elasticsearch, and why hybrid search changed how my AI system finds information.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you've made similar choices — or different ones — I'd love to hear about it in the comments. Find me on &lt;a href="https://github.com/The-Bash" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>fastapi</category>
      <category>postgres</category>
      <category>docker</category>
    </item>
    <item>
      <title>Why I Run 22 Docker Services at Home</title>
      <dc:creator>Becher Hilal</dc:creator>
      <pubDate>Sat, 04 Apr 2026 21:39:46 +0000</pubDate>
      <link>https://dev.to/bash-thedev/why-i-run-22-docker-services-at-home-23cj</link>
      <guid>https://dev.to/bash-thedev/why-i-run-22-docker-services-at-home-23cj</guid>
      <description>&lt;p&gt;Somewhere in my living room, a 2018 gaming PC is running 22 Docker containers, processing 15,000 emails through a local LLM, and managing the finances of a real business. It was never supposed to do any of this.&lt;/p&gt;

&lt;p&gt;I run a one-person software consultancy in the Netherlands; web development, 3D printing, and consulting. Last year, I started building an AI system to help me manage it all. Eight specialized agents handling email triage, financial tracking, infrastructure monitoring, and scheduling. Every piece of inference runs locally. No cloud APIs touching my private data.&lt;/p&gt;

&lt;p&gt;This post covers the hardware, what it actually costs, and what I'd do differently if I started over.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Three Machines, One Mesh Network
&lt;/h2&gt;

&lt;p&gt;The entire system runs on three machines connected via &lt;a href="https://tailscale.com/" rel="noopener noreferrer"&gt;Tailscale&lt;/a&gt; mesh VPN:&lt;/p&gt;

&lt;h4&gt;
  
  
  docker-host
&lt;/h4&gt;

&lt;p&gt;A PC I assembled from leftover parts. Over the years, as I upgraded my main gaming machine, the old CPUs, RAM sticks, and motherboards piled up. Eventually, I had enough to build a second computer. It now runs 22+ Docker containers 24/7.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU: AMD Ryzen 5 2600X (6 cores, 12 threads)&lt;/li&gt;
&lt;li&gt;RAM: 32GB DDR4 (two 16GB kits — more on this later)&lt;/li&gt;
&lt;li&gt;GPU: NVIDIA GTX 1060 3GB — useless for inference (3GB VRAM), but the Ryzen 5 2600X has no integrated graphics. Without this card, there's no display output. It exists purely to give the machine a screen.&lt;/li&gt;
&lt;li&gt;OS: Windows 11 with Docker Desktop — I still use this machine as a Windows PC occasionally, which is the honest reason it hasn't been wiped to Linux yet&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  inference
&lt;/h4&gt;

&lt;p&gt;A Mac mini M4, bought specifically for local LLM inference.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chip: Apple M4, 10-core CPU, 10-core GPU&lt;/li&gt;
&lt;li&gt;RAM: 24GB unified memory (~17GB available for models after OS and services)&lt;/li&gt;
&lt;li&gt;Role: Ollama model serving, plus Proton Mail Bridge (which requires a GUI; no headless mode exists)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  edge-vps
&lt;/h4&gt;

&lt;p&gt;A Hostinger VPS, ~€5/month. Runs Nginx Proxy Manager and Uptime Kuma. Exists for one reason: if my home network dies, this is the canary that tells me about it. You can't monitor your own availability from inside the thing that's failing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Local-First: It Started With the Subscriptions
&lt;/h2&gt;

&lt;p&gt;Before I built any of this, I was paying for Claude Pro, GPT Pro, Perplexity Pro, and Google AI. Four separate subscriptions. Each gave me partial access to models through their own interfaces, each with its own limitations on what I could integrate, and each getting a copy of whatever I fed into it.&lt;/p&gt;

&lt;p&gt;My system handles emails, bank transactions, client contracts, delivery tracking, and tax preparation, basically the complete operational picture of my business, in one database. That's the kind of data I don't want leaving my network.&lt;/p&gt;

&lt;p&gt;It's not that I think cloud providers are malicious. It's that I don't want to be in a position where I have to &lt;em&gt;trust&lt;/em&gt; their data handling with everything my business runs on. So the guardrails are explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cloud_llm_boundary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hard_rule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO cloud LLM usage by any agent without explicit human permission."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"prohibited_data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"Email content — body, subject, sender, recipient, attachments"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"Financial data — transactions, invoices, account numbers, balances"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"Client information — names, contacts, project details, contracts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"Personal data — addresses, phone numbers, government identifiers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="s2"&gt;"Infrastructure — credentials, API keys, internal hostnames, IPs"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"exceptions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Development and debugging only, never with production data."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every piece of production inference runs on Ollama on the Mac mini. Zero tokens leave the house for private data processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Math
&lt;/h2&gt;

&lt;p&gt;This is the part that convinced me the approach was sustainable:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Local Cost&lt;/th&gt;
&lt;th&gt;Cloud Equivalent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM inference&lt;/td&gt;
&lt;td&gt;€0 (electricity)&lt;/td&gt;
&lt;td&gt;€100-500/mo (API usage at similar volume)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;€0 (Docker)&lt;/td&gt;
&lt;td&gt;€50-200/mo (managed Postgres)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elasticsearch&lt;/td&gt;
&lt;td&gt;€0 (Docker)&lt;/td&gt;
&lt;td&gt;€100-300/mo (Elastic Cloud)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;n8n&lt;/td&gt;
&lt;td&gt;€0 (self-hosted)&lt;/td&gt;
&lt;td&gt;€24-200/mo (n8n cloud)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mattermost&lt;/td&gt;
&lt;td&gt;€0 (self-hosted)&lt;/td&gt;
&lt;td&gt;€0-50/mo (limited free tier)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;€5/mo (VPS)&lt;/td&gt;
&lt;td&gt;€20-50/mo (Datadog, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~€5/mo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;€300-1,300/mo&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But electricity is real. The Ryzen 5 2600X idles around 65W, the Mac mini M4 around 5-7W (rising to ~30W during inference). Call it 100W average for the whole setup. At Dutch electricity prices (~€0.25/kWh), that's about &lt;strong&gt;€25-30/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real total: ~€35/month&lt;/strong&gt; versus a minimum of €300/month in cloud services. And I went from four AI subscriptions down to one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Running: 22 Containers
&lt;/h2&gt;

&lt;p&gt;The VPS runs separately with just 2 containers: Nginx Proxy Manager for webhook ingress and Uptime Kuma for external monitoring. Everything else is on docker-host.&lt;/p&gt;

&lt;h3&gt;
  
  
  The RAM Reality
&lt;/h3&gt;

&lt;p&gt;Here's the part nobody shows you in tutorials: how 32GB actually gets divided:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Windows 11 OS overhead:              ~4 GB
Elasticsearch (Java heap):            4 GB  (-Xms4g -Xmx4g)
n8n (Node.js):                      ~4-6 GB typical usage
PostgreSQL:                          ~1 GB
Mattermost:                         ~0.5 GB
7x Python services:                  ~2 GB total
Other containers:                    ~1 GB
Docker engine overhead:              ~1 GB
─────────────────────────────────────────
Total:                              ~18-20 GB typical, ~30 GB under load
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The n8n allocation deserves explanation. It's configured with &lt;code&gt;NODE_OPTIONS=--max-old-space-size=16384&lt;/code&gt;, a 16GB ceiling. That sounds aggressive, but without it, Node.js defaults to a much lower heap limit. When a workflow processes a batch of large email bodies through an LLM and the responses come back as multi-kilobyte JSON objects, memory spikes fast. If the heap limit is too low, Node's garbage collector starts running constantly, trying to free memory instead of doing actual work. Eventually, the process crashes with an out-of-memory error. The high ceiling gives it room to breathe. In practice, n8n uses 4-6GB.&lt;/p&gt;

&lt;p&gt;The real constraint isn't peak usage; it's that everything competes for the same memory bus. When Elasticsearch is indexing, n8n is running 16 workflows, and PostgreSQL is handling a complex CTE query simultaneously... things slow down. Nothing crashes, it just slows down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ollama on the Mac Mini: The Inference Layer
&lt;/h2&gt;

&lt;p&gt;The M4's unified memory architecture is genuinely excellent for LLM inference. Unlike discrete GPUs, where you're limited by VRAM (my GTX 1060's 3GB is useless for anything beyond tiny models), the M4 can use its full 24GB for model weights. The memory bandwidth (120 GB/s) is lower than a high-end GPU, but for a 14B parameter model, it's more than enough.&lt;/p&gt;

&lt;p&gt;I run a tiered model strategy:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;qwen2.5:14b&lt;/td&gt;
&lt;td&gt;~9 GB&lt;/td&gt;
&lt;td&gt;Email triage, transaction categorization&lt;/td&gt;
&lt;td&gt;1-2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;qwen3:14b&lt;/td&gt;
&lt;td&gt;~9.3 GB&lt;/td&gt;
&lt;td&gt;Judgment calls, tool use, knowledge extraction&lt;/td&gt;
&lt;td&gt;1.5-3s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two more tiers are planned but not yet running locally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generation (qwen3:32b)&lt;/strong&gt; — for client-facing content where quality matters. Needs a GPU with more VRAM than what I currently have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision (llama3.2-vision:11b)&lt;/strong&gt; — for screenshot comparison and 3D print quality inspection. Planned for when the system matures enough to need it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With ~17GB available for models, I can only run one at a time. The keep-alive is set to 10 seconds; models unload quickly to free RAM for the next one. The flow looks like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Classification batch starts → qwen2.5:14b loads (~4 second cold start)&lt;/li&gt;
&lt;li&gt;Processes 10-50 emails → model stays warm&lt;/li&gt;
&lt;li&gt;Batch finishes → 10 seconds idle → model unloads&lt;/li&gt;
&lt;li&gt;Brain needs to reason → qwen3:14b loads&lt;/li&gt;
&lt;li&gt;Brain finishes → unloads&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This works because classification and reasoning don't overlap much. The classifier runs on a schedule; the brain runs on events. The 4-second cold start is acceptable. If I had 48GB of unified memory, I'd keep both warm permanently, but the M4 with 24GB was the sweet spot for price/performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Logging Proxy
&lt;/h3&gt;

&lt;p&gt;One of the more useful things I built is an HTTP proxy that sits between all consumers and Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Proxy sits between all services and Ollama
# Every inference call gets logged to PostgreSQL
&lt;/span&gt;&lt;span class="n"&gt;INFERENCE_ENDPOINTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/embed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;POLL_ENDPOINTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/ps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every inference request gets logged with full token counts, latency, and caller info. The logging happens in a daemon thread, so it doesn't block the response. This means I can query the usage table to see exactly which service is consuming the most tokens, what the average latency is, and which workflows are the heaviest users.&lt;/p&gt;

&lt;p&gt;All containers talk to the proxy. They never hit Ollama directly. This gives me a single point of observability for all LLM traffic across the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Machines Find Each Other
&lt;/h2&gt;

&lt;p&gt;Tailscale gives each machine a stable IP that works regardless of the physical network. No port forwarding. No dynamic DNS. No opening ports on the home router.&lt;/p&gt;

&lt;p&gt;Docker containers on the docker-host reach the inference server's Ollama through the Tailscale IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml (simplified, IPs redacted)&lt;/span&gt;
&lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;OLLAMA_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://&amp;lt;inference-tailscale-ip&amp;gt;:11433&lt;/span&gt;
      &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql://user:pass@postgres:5432/db&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Services on the same Docker host use Docker service names (e.g., &lt;code&gt;http://postgres:5432&lt;/code&gt;). Cross-machine communication goes through Tailscale IPs.&lt;/p&gt;

&lt;p&gt;I also run CoreDNS inside Docker for internal subdomain routing, friendly names like &lt;code&gt;dashboard.internal&lt;/code&gt;, &lt;code&gt;api.internal&lt;/code&gt;, all resolving to Tailscale IPs within the mesh only. One thing worth knowing if you set this up: CoreDNS in authoritative mode doesn't fall through to external DNS for missing records; it returns NXDOMAIN. So every new internal subdomain needs to be added to the zone file, or it simply won't resolve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Mystery
&lt;/h2&gt;

&lt;p&gt;The 32GB of DDR4 in docker-host is two 16GB kits of Corsair Vengeance RGB Pro, rated at 3200MHz. Same model number, same batch number; one kit bought in 2018, one in 2022. They should be as compatible as two kits can physically be.&lt;/p&gt;

&lt;p&gt;They aren't. I've set XMP to 3200MHz multiple times. With the original single kit, I even ran a stable overclock at 3600MHz. But since adding the second kit, the profile either fails to apply or reverts to JEDEC default 2133MHz after some time. No error, no BSOD,  just silently drops back.&lt;/p&gt;

&lt;p&gt;So right now, 32GB of 3200MHz-rated memory is running at 2133MHz. That's roughly 33% of the memory bandwidth sitting unused. Every container, every query, every Docker layer pull. All running at two-thirds speed on the memory bus.&lt;/p&gt;

&lt;p&gt;I haven't fully diagnosed whether it's a subtle timing incompatibility between the kits, a motherboard limitation with four DIMMs populated, or something else entirely. It's on the list, but it's the kind of issue that requires dedicated downtime to troubleshoot properly, and downtime means taking 22 containers offline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Change If I Started Over
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Linux instead of Windows on docker-host.&lt;/strong&gt; Docker on Windows works, but it adds friction everywhere. My deploy script runs PowerShell commands over SSH (&lt;code&gt;Remove-Item -Recurse -Force&lt;/code&gt; instead of &lt;code&gt;rm -rf&lt;/code&gt;). I once corrupted a CoreDNS zone file because PowerShell's &lt;code&gt;-replace&lt;/code&gt; treats &lt;code&gt;\n&lt;/code&gt; as literal text instead of a newline. Linux would eliminate an entire category of issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A dedicated, purpose-built server.&lt;/strong&gt; The current machine has three problems: it's not built for this job, it's not efficient at this job, and it has competing use cases.&lt;/p&gt;

&lt;p&gt;The docker-host is also my occasional Windows machine (I still use it for things that need Windows). That means I can't wipe it to Linux, and it means the machine is pulling double duty when it should be dedicated infrastructure. In an ideal setup, Docker lives on its own box that I never touch except to SSH into.&lt;/p&gt;

&lt;p&gt;The hardware itself is wasteful for a container host. The Ryzen 5 2600X pulls 95W TDP. Those 12 threads are genuinely useful when n8n, PostgreSQL, and Elasticsearch all spike at once, but most of the time, containers are waiting on I/O, not burning CPU. An Intel i5-12500T at 35W would handle the same workload. Then there's the GTX 1060 drawing 120W under load for absolutely nothing; it's only installed because the Ryzen has no integrated graphics. And the 650W PSU is running at maybe 20% load, which is the least efficient part of its power curve. The whole machine is basically optimized for gaming, not for sitting in a corner running Docker.&lt;/p&gt;

&lt;p&gt;My ideal replacement: something like a &lt;strong&gt;Dell OptiPlex 3080 Micro&lt;/strong&gt; — small form factor, Intel with integrated graphics (no discrete GPU needed), 16GB RAM (expandable), designed for 24/7 operation, near-silent. These go for reasonable prices secondhand, though RAM pricing makes anything above 16GB expensive. It wouldn't match the Ryzen's raw multi-threaded output, but for a Docker host that's mostly waiting on I/O and network, it doesn't need to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;48GB on the Mac mini.&lt;/strong&gt; The 24GB M4 is good, but being limited to one model at a time creates a scheduling bottleneck. With 48GB I could keep the classifier and the reasoning model warm simultaneously and cut out the cold-start latency entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with Elasticsearch earlier.&lt;/strong&gt; I started with ChromaDB for vector search because it's lighter. But once I needed hybrid search (keyword + semantic in the same query), I had to migrate anyway. If your data has both structured metadata and unstructured text (and you know you'll need to search both), start with something that handles both natively. That said, if you only need vector similarity for a smaller dataset, ChromaDB or &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;pgvector&lt;/a&gt; will save you 2GB of RAM and a lot of query DSL.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Control Argument
&lt;/h2&gt;

&lt;p&gt;Beyond cost and privacy, there's a third reason I run local-first: &lt;strong&gt;I own the upgrade timeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I decide when to update Postgres. When Elasticsearch changes licensing, it doesn't affect my running instance. When n8n raises cloud pricing, it doesn't matter. When a model provider deprecates an API version, my workflows keep running.&lt;/p&gt;

&lt;p&gt;I've been bitten by the alternative. I originally planned to use a specific open banking provider for transaction imports. They closed to new signups months after I started planning around them. Because my architecture is local-first, switching to a different provider was a contained change, one API integration, not a full re-architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is This For You?
&lt;/h2&gt;

&lt;p&gt;Honest answer: probably not, if you're building a side project or a startup MVP. The setup cost in time is real. Docker Compose files don't write themselves, Tailscale needs configuring, and you'll spend a weekend debugging why a Python service can't reach Elasticsearch through Docker's bridge network.&lt;/p&gt;

&lt;p&gt;If your data is genuinely sensitive and you have ongoing infrastructure needs, and you don't mind being your own sysadmin, it's worth considering. If you need to scale past what consumer hardware handles, or you have a team that needs managed infrastructure, or you'd rather write application code than debug Docker networking at midnight, stick with cloud services. There's no shame in that — it's a legitimate trade-off.&lt;/p&gt;

&lt;p&gt;For me, €35/month, zero data leaving the house, and full control over every component is worth being my own sysadmin, DBA, and on-call engineer. For a solo operation, that math works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 1 of "&lt;a href="https://dev.to/bash-thedev/series/38113"&gt;One Developer, 22 Containers&lt;/a&gt;" — a series about building an AI office management system on consumer hardware. Next up: the technology decisions behind every major component, what I considered, and what I'd pick differently today.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you're building something similar or have questions about any of the stack, I'd love to hear about it in the comments. You can also find me on &lt;a href="https://github.com/The-Bash" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>selfhosted</category>
      <category>ai</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
