DEV Community: Becher Hilal

Why I Let a Machine Judge My Code

Becher Hilal — Wed, 15 Apr 2026 13:07:57 +0000

There's a moment in every growing codebase where you realize you can no longer hold all of it in your head. For me, it was when I opened a file I'd written three weeks earlier and couldn't immediately follow the control flow. Not because the code was wrong. Because it had grown past the point where any single person could casually review it all with the same attention.

I run 5 Python services, a React frontend, and a growing collection of utility scripts. No team. No pull request reviewers besides myself. The codebase grows weekly. And I decided that trying to maintain consistency through discipline alone is a losing strategy at this scale. The machine needs to enforce the standards I care about, so my attention goes to the decisions that actually need a human brain.

The Pipeline

Three layers:

Guidelines file    →    Linter config    →    Pre-commit hook
(sets expectations)     (checks the code)     (blocks the commit)

The guidelines file defines the standards: naming conventions, SQL patterns, pitfalls specific to this codebase. The linter config translates those standards into automated checks. The pre-commit hook makes them non-negotiable. Code that doesn't pass doesn't enter the repository. Not warned. Blocked.

This isn't about catching catastrophic bugs. It's about preventing the slow accumulation of inconsistency that turns a clean codebase into one you dread opening. Individually, an unused import is harmless. A hundred of them spread across fifty files is how a project starts to feel unmaintained.

Documentation vs Enforcement

There's an interesting insight from the recent Claude Code source leak. Anthropic's internal tooling moved rules out of documentation files and into enforced hooks. Their reasoning: documentation competes for attention. The more guidelines you write, the more they blend into the background. Hooks are executed by the system. They don't care whether anyone read the documentation.

My experience matches. The guidelines file says "use $1, $2, $3 for SQL placeholders." That convention gets followed most of the time. The linter catches it every time. "Most" vs "every" is the difference between a suggestion and an enforcement. Both have a place, but I know which one I trust when I'm not actively watching.

The Interesting Rules

I run 23 Ruff rule prefixes. The basics (formatting, import ordering, dead imports, stray print() calls) are table stakes. They keep the codebase clean but they're not worth writing about individually. Here's what actually gets interesting.

SQL Injection Detection

My API layer uses raw asyncpg with parameterized queries. No ORM. That means every database query is a hand-written SQL string. The safe way looks like this:

query = "SELECT * FROM transactions WHERE category = $1"
rows = await fetch_all(query, category)

The $1 is a placeholder. The database receives the query and the variable separately, so it always treats category as data, never as executable SQL. Even if someone passes '; DROP TABLE transactions; -- as the value, nothing bad happens. It's just a weird category name that matches zero rows.

The dangerous version:

query = f"SELECT * FROM transactions WHERE category = '{category}'"

This pastes the variable directly into the SQL string. If the input is malicious, the database executes it as code. Classic SQL injection. The linter flags this pattern and blocks the commit.

The complication: because I write raw SQL strings (not an ORM building them), the linter also flags my safe queries as "hardcoded SQL." I have to explicitly tell it that SQL strings are intentional in this codebase:

ignore = [
    "S608",   # asyncpg uses $1 params, raw SQL strings are intentional
]

Every ignore has a documented reason. When I'm reviewing this config months from now, I need to know whether each exception was a deliberate decision or a shortcut.

Complexity Scoring

McCabe complexity, max of 10. Roughly 10 independent code paths through a function. When I introduced the linter to the existing codebase, some older functions exceeded this threshold. That's normal when you retrofit standards onto code that was already running. The policy: existing code that exceeds the limit gets an exemption. New code doesn't. The exemption list is meant to shrink over time as I refactor, not grow.

[tool.ruff.lint.mccabe]
max-complexity = 10

The complexity check is the one that's saved me the most time long-term. A function with a score of 12+ is a function where fixing a bug in one branch risks breaking three others. Catching that before it lands means refactoring when the context is fresh, not six months later when I've forgotten why the branches exist.

The Async Concurrency Trap

This pattern looks concurrent:

results = [await process(item) for item in items]

It's not. It awaits each call one after another, sequentially. If process() takes 100ms and you have 20 items, that's 2 seconds. The concurrent version:

results = await asyncio.gather(*[process(item) for item in items])

Same 20 items, but they all run at the same time. Total time: ~100ms instead of 2 seconds.

The ASYNC rule set flags the sequential pattern. The linter doesn't auto-fix this one since the fix changes execution behavior, not just formatting. It blocks the commit with an explanation, and you decide how to restructure.

In a codebase with 70+ async API routes, this pattern showing up undetected in a few hot paths would quietly degrade response times without any obvious cause.

Timezone Awareness

Every datetime.now() call without timezone info is a future debugging session. In a system that handles Dutch business hours, UTC timestamps from external APIs, and scheduled tasks that need to run at specific local times, a naive datetime silently produces the wrong result in any calculation that crosses a timezone boundary.

The DTZ rules force datetime.now(UTC) everywhere. It's the kind of rule that feels pedantic until you spend an afternoon figuring out why a scheduled task ran an hour early because of a daylight saving time transition.

The Pre-Commit Hook

repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.9.10
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format

Every commit runs Ruff. The --fix flag auto-resolves what it can: import sorting, formatting, simple simplifications. Anything it can't auto-fix blocks the commit until you handle it manually. The linter tells you exactly what's wrong and where.

What This Doesn't Catch

Linters understand structure, not intent. They won't tell you a function returns the wrong result. They won't tell you a query is technically valid SQL but returns stale data. They won't tell you an API endpoint works but exposes data it shouldn't.

For that you need tests, thoughtful design, and the kind of review that requires understanding what the code is supposed to do. The linter handles the mechanical layer: formatting, dead code, complexity, security anti-patterns, timezone correctness. That's maybe 80% of the issues that make a codebase worse over time. Automating that 80% means your limited review time goes to the 20% that actually needs a human.

Where This Is Going

Right now, the gap in the industry isn't AI code generation. That works well enough. The gap is the control layer around it. Experienced developers set up linters, pre-commit hooks, and CI gates because they've learned from years of debugging what happens when standards aren't enforced. They know why parameterized queries matter because they've seen an injection. They know why complexity limits matter because they've maintained a 300-line function.

A newer generation of developers is building with powerful tools from day one, but without the scar tissue that makes you instinctively set up guardrails. The industry is going to need better defaults, better tooling, and better education around code quality enforcement. Not because new developers are worse. Because the tools are fast enough that the consequences of shipping without guardrails arrive faster too.

Mechanical enforcement isn't a substitute for skill. It's what lets skill scale.

This is part of "1 Dev, 22 Containers," a series about building an AI office management system on consumer hardware.

Find me on GitHub.

I Open-Sourced My Ollama Logging Proxy

Becher Hilal — Thu, 09 Apr 2026 13:51:43 +0000

When you use a cloud LLM API, you get usage data for free. Token counts, latency, cost per request, all tracked and queryable. When you run Ollama locally, you get a response and nothing else. No logs, no token counts, no way to tell which of your services is eating all the inference time.

I had 5 services and a workflow engine all hitting the same Ollama instance on a Mac mini with 24GB of RAM. I couldn't tell which service was the heaviest consumer, whether my model-swapping strategy was working, or if specific workflows were making redundant calls. I was flying blind.

So I built a transparent proxy that sits between my services and Ollama, logs every inference call with full token counts and timing, and streams responses through without adding any latency. Then I open-sourced it.

ollama-log-proxy on GitHub

The Concept

Point your apps at the proxy (port 11433) instead of Ollama directly (port 11434). The proxy forwards everything transparently. For inference calls, it records what model was used, how many tokens were consumed, how long it took, and which service made the request. Health checks and model listing get passed through without logging.

Response streaming works exactly as it does with Ollama directly. The proxy never buffers a full response before forwarding. Clients see tokens appearing in real time during long generations. The token counting happens from a copy of the stream after the response completes.

What the Data Actually Told Me

The reason I'm writing about this isn't the proxy itself. It's what I learned once I had visibility into my LLM traffic.

Model swap overhead is measurable. I run two models on a machine with only enough memory for one at a time. They swap via a 10-second keep-alive timeout. The proxy showed me that the first request after a model swap is consistently 3-4x slower than subsequent requests. That's the cold-start penalty of loading a 9GB model into memory. Before the proxy, I assumed model swapping was "fast enough." Now I have actual numbers and can make informed decisions about keep-alive timing.

One workflow was consuming 40% of all inference. The email classification pipeline was the heaviest consumer by a wide margin. Seeing the actual numbers made it the obvious first target for optimization. I restructured the batching to reduce total calls, and the overall daily inference time dropped noticeably. Without per-caller breakdown, I would have been optimizing the wrong things.

Embedding calls are almost free. My embedding model returns in under 100ms consistently. I had been conservative about batching embeddings because I assumed they were competing for resources with the larger models. They weren't. The proxy data gave me confidence to run embeddings more aggressively.

None of this was visible before. I was making assumptions about my own infrastructure that turned out to be wrong.

What's In the Repo

The tool is designed to be useful with zero configuration (install, run, point your apps at it, logs go to SQLite) but flexible enough for production setups. It ships with:

4 storage backends: SQLite (default), PostgreSQL, JSONL files, or stdout
A built-in dashboard: vanilla HTML and Chart.js, no npm, no build step
Prometheus metrics: request counts, token totals, and duration histograms, ready for Grafana
A pre-built Grafana dashboard: included in the repo with a Docker Compose file that wires everything together
An extensible backend protocol: implement three Python methods and you can send logs anywhere

The README has the full setup guide, CLI reference, and Docker instructions. I won't repeat all of that here.

Why I Open-Sourced It

This started as an internal tool. A single Python script that logged Ollama calls to a PostgreSQL table. It grew into something more general because the problem isn't specific to my setup. Anyone running Ollama for real workloads (not just chatting with a model in a terminal) eventually needs to know what's happening at the request level.

The local LLM ecosystem has great tools for running models and great tools for building applications on top of them. The observability layer in between is mostly missing. This is one piece of that.

If you run Ollama and you've ever wondered "which of my services is actually using the most tokens," give it a try. The README walks through everything from a zero-config SQLite setup to a full Docker + Grafana stack.

github.com/The-Bash/ollama-log-proxy

I Open-Sourced My Ollama Logging Proxy

Becher Hilal — Thu, 09 Apr 2026 13:51:43 +0000

ollama-log-proxy on GitHub

The Concept

What the Data Actually Told Me

The reason I'm writing about this isn't the proxy itself. It's what I learned once I had visibility into my LLM traffic.

None of this was visible before. I was making assumptions about my own infrastructure that turned out to be wrong.

What's In the Repo

The tool is designed to be useful with zero configuration (install, run, point your apps at it, logs go to SQLite) but flexible enough for production setups. It ships with:

4 storage backends: SQLite (default), PostgreSQL, JSONL files, or stdout
A built-in dashboard: vanilla HTML and Chart.js, no npm, no build step
Prometheus metrics: request counts, token totals, and duration histograms, ready for Grafana
A pre-built Grafana dashboard: included in the repo with a Docker Compose file that wires everything together
An extensible backend protocol: implement three Python methods and you can send logs anywhere

The README has the full setup guide, CLI reference, and Docker instructions. I won't repeat all of that here.

Why I Open-Sourced It

The local LLM ecosystem has great tools for running models and great tools for building applications on top of them. The observability layer in between is mostly missing. This is one piece of that.

github.com/The-Bash/ollama-log-proxy

'I'll Add the Migration Later'... The Lies I Told Myself

Becher Hilal — Wed, 08 Apr 2026 13:32:05 +0000

I built a 127-table PostgreSQL database over three months. For the first week, every schema change was tracked properly. Numbered SQL files, a migration table with checksums, clean and reproducible. Then development velocity picked up, I started prioritizing feature delivery over process, and the tracking fell behind.

By the time I audited the state, I had 19 tracked migrations from week 1 and over 100 untracked schema changes after that. No rollback capability. No record of what was applied when. No way to reproduce the database from scratch.

This post is about what that cost me, how I recovered, and why migration discipline matters more when you're moving fast, not less.

How the Tracking Fell Behind

Week 1 was clean. Custom migration system, numbered SQL files, SHA-256 checksums:

migrations/
  000_schema_migrations.sql
  001_init.sql
  002_init-finance.sql
  003_init-memory.sql
  ...
  018_add-gocardless-columns.sql

Then the pace changed. I was running multiple development sessions in parallel, each one producing features that needed schema changes. The quick path was deploying SQL directly to the running database:

ssh docker-host "docker exec -i postgres psql -U user -d db" <<< "
ALTER TABLE email_messages ADD COLUMN knowledge_extracted_at TIMESTAMPTZ;
CREATE INDEX idx_emails_knowledge ON email_messages (knowledge_extracted_at);
"

The SQL was always written properly. The change itself was correct. What I skipped was registering it in the migration system. Every time, the reasoning was the same: the feature is urgent, the schema change is simple, I'll add the migration file when things slow down.

Things never slowed down. Over 11 weeks, the live database grew to 127 tables while the migration history still described 60.

The situation got more complicated with parallel development. Two sessions both needed to add columns to the same table. Both changes deployed fine because they were different columns. But neither was tracked. A week later, a new session read the schema files from git, assumed those columns didn't exist, and wrote code that conflicted with the live state.

What It Cost Me

The most expensive incident: I deployed a schema change while a classification workflow was processing a batch of emails. The workflow had already classified a bunch of entries and marked them as "in progress," but hadn't written the results yet. When the schema changed, the final INSERT failed. Those emails were now stuck. The pipeline wouldn't reprocess them because they were flagged as in-progress, but the results never landed. Manual recovery took a couple of hours.

The fix was simple once I found the stuck records. But the root cause wasn't the schema change itself. It was that I had no process for coordinating database changes with running workflows. No migration file meant no review step, no "is anything depending on this column right now" check before executing.

The ongoing friction was worse than any single incident. New development sessions would reference schema files from git that were weeks out of date. Code would target columns that existed in production but not in the tracked schema, or vice versa. When something looked wrong in the data, there was no migration history to tell me what changed and when.

The Recovery: Baseline and Move Forward

I couldn't retroactively track 100+ changes. The pragmatic choice was to accept that weeks 2-12 were untracked, snapshot the current state, and build proper tracking from that point forward.

I evaluated migration tools against my actual workflow. I write raw SQL with asyncpg, no ORM, so Alembic (which generates migrations from SQLAlchemy models) wasn't a fit. Flyway needs a JVM runtime. I went with dbmate: a single binary, plain SQL migration files with up/down sections, built-in state tracking, and rollback support. It matched how I already write database code without adding framework dependencies.

The baseline process:

Snapshot the live schema. pg_dump --schema-only exported the full database structure: 9,756 lines covering all 127 tables, indexes, constraints, and functions.
Convert it to a dbmate migration. Added dbmate's -- migrate:up and -- migrate:down markers so the tool recognizes the file format.
Register it as already applied. Created dbmate's tracking table and inserted the baseline version, so dbmate knows "this is the starting point, don't try to run it."
Preserved the old history. Renamed the original migration table to schema_migrations_legacy rather than dropping it, in case I ever need to reference the week 1 records.

After this, dbmate status showed 1 applied, 0 pending. Clean slate. The 9,756-line baseline is the single source of truth for the database structure.

The New Process

Every schema change now follows three rules:

The migration file gets created before the change is applied. Not after. The act of writing the migration IS the review step. It forces me to think about what I'm changing, whether anything depends on it, and how to reverse it.

Every migration has a rollback section. Even if the rollback is just "drop the column I added." This is what the -- migrate:up and -- migrate:down format gives you:

-- migrate:up
ALTER TABLE scheduler_tasks
    ADD COLUMN IF NOT EXISTS retry_count INT DEFAULT 0;
ALTER TABLE scheduler_tasks
    ADD COLUMN IF NOT EXISTS max_retries INT DEFAULT 3;

-- migrate:down
ALTER TABLE scheduler_tasks
    DROP COLUMN IF EXISTS retry_count;
ALTER TABLE scheduler_tasks
    DROP COLUMN IF EXISTS max_retries;

The IF NOT EXISTS / IF EXISTS pattern makes migrations idempotent. If something interrupts the process halfway, I can re-run safely.

The migration commits with the feature code. Same commit, same PR. git log always links a schema change to the feature that needed it. No more guessing when a column was added or why.

What I'd Tell Someone Starting a Similar Project

Track your schema changes from day one. Even if it's just numbered SQL files in a directory. The tooling doesn't matter as much as the habit. A migration system you actually use beats a sophisticated one you skip when you're in a hurry.

If you're already behind, don't try to retroactively reconstruct history. Snapshot your current state with pg_dump, declare it the baseline, and track forward. Accepting that some history is lost is better than pretending it's tracked when it isn't.

If you use AI coding tools for development, add migration creation to your workflow instructions. These tools generate schema changes fast and won't create migration files unless you explicitly ask them to. The more automated your development, the more important it is that the tracking step is built into the process rather than treated as a separate manual step.

And coordinate schema changes with your running services. A database migration that's technically correct can still cause problems if it modifies something that active workflows depend on. The migration file is the natural place to document those dependencies, even if it's just a comment at the top: "this column is used by the email classification pipeline, deploy the updated pipeline code first."

This is Part 4 of "One Developer, 22 Containers." The series covers building an AI office management system on consumer hardware, the choices, the trade-offs, and the things that broke along the way.

Find me on GitHub.

Why I Switched from ChromaDB to ElasticSearch (and What I Miss)

Becher Hilal — Mon, 06 Apr 2026 12:31:48 +0000

My AI system had been extracting knowledge from emails for weeks. Thousands of facts, entities, patterns, all sitting in PostgreSQL. The problem was finding any of it. The brain was using hardcoded SQL filters like WHERE category = 'infrastructure' to pull context before making decisions. If a fact about hosting costs was categorized under "billing," the brain would never see it when reasoning about infrastructure.

I needed to search by meaning, not by label. But I also needed to search by exact match. An invoice number like "TDS-2026-003" has no semantic meaning. You can't find it with vector search. You need both approaches working together, in one query.

That's what led me to Elasticsearch, and that's what this post is about.

The Problem With Vector-Only Search

ChromaDB had been in my Docker Compose file since the early days. The plan was a full RAG pipeline: embed documents, store vectors, retrieve relevant chunks. In practice, ChromaDB sat mostly idle while the knowledge base grew faster than expected. By the time I circled back to search, the requirements had changed.

The knowledge base isn't just natural language. It contains invoice reference codes, domain names, specific dates, and contract numbers. All things that have no semantic meaning. When you vector-search for "TDS-2026-003," the embedding model encodes it into a 768-dimensional space and looks for nearby vectors. But an arbitrary reference code has no meaningful position in that space. The vector is vaguely related to "invoices" but won't reliably surface the one document containing that exact string.

This isn't a ChromaDB problem. It's the nature of pure vector search. Any system that only does similarity matching will struggle with exact lookups.

The requirement that forced the decision: the brain needs to find "TDS-2026-003" (exact keyword match) AND "hosting expenses" (semantic similarity) through the same search system. Ideally in the same query.

Why Not pgvector

I was already running PostgreSQL with 95+ tables. Adding pgvector would mean no extra container, no extra service to maintain. It was the obvious first choice.

The problem is combining results. pgvector gives you vector similarity. PostgreSQL's built-in tsvector gives you full-text search. But merging the two result sets into a single ranked list, where a document that scores well on both keyword match and semantic similarity ranks higher than one that only matches on one, requires building your own scoring logic. You're essentially building a search engine inside your database.

Elasticsearch has a built-in algorithm for this called Reciprocal Rank Fusion (RRF). It takes the ranking from BM25 (keyword matching) and the ranking from kNN (vector similarity) and combines them mathematically. A document that appears near the top of both lists gets a higher combined score than one that only appears in one. No custom scoring logic, no manual result merging.

For simpler use cases where you only need vector similarity or only need keyword search but not both combined with principled ranking, pgvector is the better choice. Less infrastructure, less complexity, and it lives inside a database you're already running.

The Migration

Setting up Elasticsearch itself was straightforward. Single node, no cluster, security disabled (behind Tailscale, no public access):

elasticsearch:
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms4g -Xmx4g
      - xpack.ml.enabled=false

Disabling the ML features saved about 500MB of heap. I don't use Elasticsearch's built-in ML since all inference runs through Ollama.

The index design followed a pattern I'd already established in PostgreSQL: separate data domains. Facts, entities, patterns, emails, transactions, and contracts each got their own index. Each domain has different fields and different query needs. Cramming everything into one giant index would mean sparse fields everywhere and confusing relevance scoring.

Every index follows the same hybrid field pattern. The primary searchable text is stored twice: once as a text field for BM25 keyword matching, and once as a dense_vector field for kNN similarity:

# Example: the facts index
"fact":        {"type": "text", "analyzer": "standard"},
"fact_vector": {"type": "dense_vector", "dims": 768,
                "index": True, "similarity": "cosine"},

The embedding model is nomic-embed-text running on the Mac mini through Ollama. 768 dimensions, decent multilingual support, which matters for a business that receives emails in Dutch, English, and Arabic. Each text gets truncated to 500 characters before embedding, which keeps throughput consistent and captures most of the semantic signal.

One thing I learned during indexing: enriching the text before embedding makes a big difference. "Hetzner" alone produces a generic vector. "Hetzner (company) cloud hosting hetzner.com" produces a much more useful one that captures what the entity actually is. Same approach for emails (subject + body snippet) and transactions (counterparty + description).

What Hybrid Search Looks Like

The core function combines BM25 and kNN in a single Elasticsearch query:

async def hybrid_search(index, query, k=10, filters=None):
    text_field, vector_field = INDEX_FIELDS[index]
    query_vector = await embed_text(query)

    es_query = {
        "size": k,
        "query": {
            "bool": {
                "should": [
                    {"match": {text_field: query}}
                ],
            }
        },
        "knn": {
            "field": vector_field,
            "query_vector": query_vector,
            "k": k,
            "num_candidates": max(k * 5, 50),
        },
        "rank": {"rrf": {"window_size": 50, "rank_constant": 20}},
        "_source": {"excludes": [vector_field]},
    }
    # ... execute and return hits

The rank.rrf does the heavy lifting. For each document, it computes a score based on where it ranks in both the keyword results and the vector results. A document that ranks high in both lists gets a combined score higher than one that only appears in one. The rank_constant of 20 controls how much weight goes to top-ranked results versus the rest.

In practice, this means:

Searching for "TDS-2026-003": BM25 finds the exact document. The kNN results are vaguely invoice-related noise. RRF correctly puts the exact match at #1.
Searching for "hosting expenses": BM25 might find documents with those literal words. kNN finds "server rental charges," "VPS monthly payment," "cloud infrastructure billing." Conceptually identical, zero shared keywords. RRF combines both, giving you broader and more useful results than either approach alone.
Searching for "Hetzner invoice": BM25 catches the exact name "Hetzner." kNN catches hosting-invoice-related concepts. Documents specifically about Hetzner invoices appear in both result sets and rank highest.

Keeping It In Sync

New knowledge gets extracted from emails continuously. Those new facts need to end up in Elasticsearch. Rather than syncing inline, which would add latency to the extraction pipeline, I added an es_indexed boolean column to each table with a trigger:

CREATE OR REPLACE FUNCTION fn_mark_es_unindexed()
RETURNS TRIGGER AS $$
BEGIN
    NEW.es_indexed := FALSE;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Fires on INSERT or content UPDATE, not metadata changes
CREATE TRIGGER trg_facts_es_sync
    BEFORE INSERT OR UPDATE OF fact, category, confidence
    ON memory_facts
    FOR EACH ROW EXECUTE FUNCTION fn_mark_es_unindexed();

A workflow polls every few minutes for rows where es_indexed = FALSE, embeds them, indexes them to Elasticsearch, and flips the flag. Partial indexes on the boolean column make the "find unindexed rows" query fast since the index only contains rows that actually need syncing.

It's not real-time, but for my use case a few minutes of delay between extraction and searchability is fine.

The Trade-Offs

Search actually works now. The brain can find relevant context regardless of how facts were categorized. Exact matches and semantic matches in one query. Structured filtering by category, confidence score, or date range combined with free-text search.

The cost: 4.5GB of RAM on a machine already running 20 containers. ChromaDB used 53MB. That's roughly 85x more memory. Elasticsearch is a JVM application and it shows. The heap allocation alone is 4GB, and you can't really go lower for production use.

The query DSL is also significantly more verbose than ChromaDB's Python API. ChromaDB is collection.query(query_texts=["hosting"], n_results=10). The Elasticsearch equivalent is the nested JSON you saw above. I wrapped it in helper functions so the rest of the codebase doesn't have to deal with it, but the learning curve is real.

Would I do it again? Yes, because hybrid search solved a real problem I couldn't solve any other way. But I wouldn't recommend Elasticsearch for every project that needs search. If you only need vector similarity, ChromaDB or pgvector are simpler and lighter. Elasticsearch earns its 4GB when you need keyword matching and semantic search working together in the same query.

This is Part 3 of "One Developer, 22 Containers." The series covers building an AI office management system on consumer hardware, the choices, the trade-offs, and the things that broke along the way.

Find me on GitHub.

The Stack Nobody Recommended

Becher Hilal — Sun, 05 Apr 2026 11:41:13 +0000

The most common question I got after publishing Part 1 was some variation of "why did you pick X instead of Y?" So this post is about that. Every major technology choice, what I actually considered, where I was right, and where I got lucky.

I'll be upfront: some of these were informed decisions. Some were "I already know this tool, and I need to move fast." Both are valid, but they lead to different trade-offs down the line.

The Backend: FastAPI

I come from JavaScript and TypeScript. Years of React on the frontend, Express and Fastify on the backend. When I decided this project would be Python, because that's where the AI/ML ecosystem lives, I needed something that didn't feel foreign.

FastAPI clicked immediately. The async/await model, the decorator-based routing, and type hints that actually do something. It felt like writing Fastify in Python. That familiarity wasn't the whole reason, but I'd be lying if I said it wasn't a factor.

The technical reasons held up though. The system handles concurrent webhook callbacks from n8n, real-time polling from the React dashboard, and persistent asyncpg connections to PostgreSQL. All of that is async I/O, and FastAPI was built around that pattern. Django's async support exists now, but it still feels like it was added after the fact rather than designed in.

I also deliberately avoided using an ORM. Every query in the system is hand-written SQL through asyncpg. With 95+ tables across 9 domains, I wanted to see exactly what was hitting the database. No magic, no N+1 surprises, no migration framework generating SQL I haven't read.

The price I paid for skipping Django? No free admin panel; I built a React dashboard from scratch, which took weeks. No built-in migration system, I manage schema changes with raw SQL files piped through SSH into Docker, which has bitten me more than once (shell quoting across SSH → Docker → psql mangles complex statements). And a thinner plugin ecosystem when I need something that Django has had for 20 years.

If you're building a web app with user accounts, admin panels, and forms, just use Django. FastAPI makes sense when your backend is an API layer coordinating between services, which is my situation.

The Database: PostgreSQL

This wasn't a difficult decision. My data is deeply relational, transactions link to bank accounts, email classifications reference messages, knowledge facts get reinforced across multiple sources, scheduler tasks reference agents that reference models. Trying to do this in MongoDB would mean denormalizing everything, embedding documents within documents, and handling consistency manually.

But PostgreSQL gives me things beyond just relational storage that turned out to be critical.

LISTEN/NOTIFY replaced what would normally require a message queue. When an email gets classified, a trigger fires a notification. The brain service catches it in milliseconds via asyncpg and reacts. No Kafka, no RabbitMQ; just a built-in feature that's been in Postgres for years:

CREATE OR REPLACE FUNCTION notify_email_classified()
RETURNS TRIGGER AS $$
BEGIN
    PERFORM pg_notify('email_classified',
        json_build_object(
            'id', NEW.id,
            'category', NEW.category,
            'urgency', NEW.urgency
        )::text
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

At my scale (maybe 50-100 events per hour), this is more than enough. Adding Kafka would mean another container, another config to maintain, and another thing that can go wrong at 3am. I'll add it when I actually need it.

CHECK constraints turned out to be one of the best decisions in the whole project. The database enforces what categories the AI is allowed to output:

category VARCHAR(50) CHECK (category IN (
    'billing', 'shipping', 'subscription',
    'employment', 'legal', 'marketing',
    'personal', 'automated' ...
))

LLMs ignore your instructions sometimes. The extractor once invented a category that wasn't in the allowed list, and the INSERT failed. That's exactly what should happen: a loud failure is infinitely better than silently polluting your data with invalid categories.

I also use window functions and interval queries for rate limiting, cooldowns, and circuit breakers. All things you'd normally reach for Redis to do. One fewer container in the stack.

Where MongoDB would win: truly document-shaped data with variable schemas. CMS content, user profiles with heterogeneous fields, and event logs with different payloads. My data isn't any of those things.

The Workflow Engine: n8n

This is the decision I have the most complicated feelings about.

n8n is a self-hosted visual workflow editor. You wire together triggers, HTTP requests, database queries, and code nodes. For my email pipelines, being able to see the flow as a diagram is genuinely valuable. When something breaks, I can see exactly which step failed and what data it had.

The self-hosting angle ruled out Zapier and Make immediately. My workflows process email bodies and financial data. That doesn't go through a third party. And n8n's code nodes let me drop JavaScript directly into a workflow step, which is how I build the complex JSON payloads for Ollama calls.

But n8n has caused more production incidents than any other component in the system. Scheduled workflows that overlap because n8n doesn't prevent concurrent executions by default. I had to build a database-level guard to check whether a previous run was still in progress. The API silently truncates long SQL queries without any error. Code nodes run in a sandboxed V8 isolate where process.env doesn't exist (you need $env instead), and building JSON in HTTP Request expressions is fragile enough that complex payloads should always go through a Code node first.

None of these are dealbreakers individually. But collectively, n8n demands a level of defensive programming that I didn't expect from a workflow tool. Every workflow that involves an LLM call now has a stacking check, every SQL query gets verified after deployment, and I've learned to build payloads in Code nodes instead of expression fields.

If your workflows are mostly code with minimal visual benefit, write Python scripts with a scheduler. The visual editor is n8n's actual advantage. If you don't need it, you're adding complexity for nothing.

Local LLM Serving: Ollama

Ollama won on simplicity and nothing else. Install it, ollama pull qwen3:14b, and there's a model serving API on localhost:11434. No CUDA configuration, no Python environment management, no Docker GPU passthrough headaches.

Switching between models is changing one string in the request payload. The API is consistent across every model (/api/chat, /api/generate, /api/embed), which makes the routing logic in my system trivial.

What I gave up: vLLM offers tensor parallelism, continuous batching, and quantization control that Ollama hides behind its abstraction. For a platform serving many concurrent users, vLLM is the right choice. For a single-user system running one model at a time on a Mac mini, Ollama's defaults are fine, and the setup time difference is measured in hours.

Communication: Mattermost (For Now)

I need human-in-the-loop approval for every consequential action. The system posts to a chat with context and Approve/Reject buttons. I click, a webhook fires, and the workflow continues.

I picked Mattermost because it's open source, self-hosted, and has interactive message attachments. That was the full evaluation. It wasn't strategic. iIt was "this runs in Docker and has buttons."

It works. But I'm planning to migrate to Rocket.Chat. I want voice interaction with the assistant eventually, and Mattermost's audio calling is limited. Rocket.Chat also has more mature mobile apps, which matters because the whole point of HITL is approving actions when I'm not at my desk.

Networking: Tailscale

Three machines need to talk to each other. Tailscale gives each one a stable IP that works regardless of the physical network. No port forwarding, no dynamic DNS, no opening ports on my router. Setup took about 10 minutes.

I could have configured WireGuard manually for the same encryption and performance, but then I'd be managing key rotation, endpoint configs, and NAT traversal myself. For a three-node network, Tailscale's convenience is worth it.

One thing people ask: why not Cloudflare Tunnels? Because they solve a different problem. Cloudflare Tunnels expose services to the internet through Cloudflare's network. My services don't need to be on the internet; they need to talk to each other privately. Mesh VPN, not reverse proxy.

Search: Elasticsearch (Added Later)

I didn't start with Elasticsearch. I started with ChromaDB because it's lighter, runs in Docker, has a simple Python API, and is good enough for basic vector search.

The problem showed up when the knowledge base grew. I had thousands of facts, entities, and patterns, and I needed to search by meaning and by exact keywords in the same query. ChromaDB handles vectors. PostgreSQL handles keywords. But running two searches across two systems and merging results is fragile and slow.

Elasticsearch does both natively (BM25 for exact keyword matching, kNN for vector similarity) in a single query. That's what made me migrate. The trade-off is 4GB of heap memory on a machine that was already tight. For smaller datasets or pure vector search, ChromaDB or pgvector are lighter options.

I'll cover the migration in a dedicated post.

What I'd Do Differently

Deployment. Right now, I deploy by SSH-ing into a Windows host and running PowerShell commands. No CI/CD, no GitHub Actions. It works because I'm the only developer, but it's the first thing that would break if anyone else needed to contribute.

If I started over, Linux from day one and a basic GitHub Actions pipeline; push to main, build container, deploy. Not Kubernetes, not Terraform. Just automating the 90-second script I currently run manually.

This is Part 2 of "One Developer, 22 Containers". Next up: migrating from ChromaDB to Elasticsearch, and why hybrid search changed how my AI system finds information.

If you've made similar choices — or different ones — I'd love to hear about it in the comments. Find me on GitHub.

Why I Run 22 Docker Services at Home

Becher Hilal — Sat, 04 Apr 2026 21:39:46 +0000

Somewhere in my living room, a 2018 gaming PC is running 22 Docker containers, processing 15,000 emails through a local LLM, and managing the finances of a real business. It was never supposed to do any of this.

I run a one-person software consultancy in the Netherlands; web development, 3D printing, and consulting. Last year, I started building an AI system to help me manage it all. Eight specialized agents handling email triage, financial tracking, infrastructure monitoring, and scheduling. Every piece of inference runs locally. No cloud APIs touching my private data.

This post covers the hardware, what it actually costs, and what I'd do differently if I started over.

The Setup: Three Machines, One Mesh Network

The entire system runs on three machines connected via Tailscale mesh VPN:

docker-host

A PC I assembled from leftover parts. Over the years, as I upgraded my main gaming machine, the old CPUs, RAM sticks, and motherboards piled up. Eventually, I had enough to build a second computer. It now runs 22+ Docker containers 24/7.

CPU: AMD Ryzen 5 2600X (6 cores, 12 threads)
RAM: 32GB DDR4 (two 16GB kits — more on this later)
GPU: NVIDIA GTX 1060 3GB — useless for inference (3GB VRAM), but the Ryzen 5 2600X has no integrated graphics. Without this card, there's no display output. It exists purely to give the machine a screen.
OS: Windows 11 with Docker Desktop — I still use this machine as a Windows PC occasionally, which is the honest reason it hasn't been wiped to Linux yet

inference

A Mac mini M4, bought specifically for local LLM inference.

Chip: Apple M4, 10-core CPU, 10-core GPU
RAM: 24GB unified memory (~17GB available for models after OS and services)
Role: Ollama model serving, plus Proton Mail Bridge (which requires a GUI; no headless mode exists)

edge-vps

A Hostinger VPS, ~€5/month. Runs Nginx Proxy Manager and Uptime Kuma. Exists for one reason: if my home network dies, this is the canary that tells me about it. You can't monitor your own availability from inside the thing that's failing.

Why Local-First: It Started With the Subscriptions

Before I built any of this, I was paying for Claude Pro, GPT Pro, Perplexity Pro, and Google AI. Four separate subscriptions. Each gave me partial access to models through their own interfaces, each with its own limitations on what I could integrate, and each getting a copy of whatever I fed into it.

My system handles emails, bank transactions, client contracts, delivery tracking, and tax preparation, basically the complete operational picture of my business, in one database. That's the kind of data I don't want leaving my network.

It's not that I think cloud providers are malicious. It's that I don't want to be in a position where I have to trust their data handling with everything my business runs on. So the guardrails are explicit:

{
    "cloud_llm_boundary": {
        "hard_rule": "NO cloud LLM usage by any agent without explicit human permission.",
        "prohibited_data": [
            "Email content — body, subject, sender, recipient, attachments",
            "Financial data — transactions, invoices, account numbers, balances",
            "Client information — names, contacts, project details, contracts",
            "Personal data — addresses, phone numbers, government identifiers",
            "Infrastructure — credentials, API keys, internal hostnames, IPs"
        ],
        "exceptions": "Development and debugging only, never with production data."
    }
}

Every piece of production inference runs on Ollama on the Mac mini. Zero tokens leave the house for private data processing.

The Cost Math

This is the part that convinced me the approach was sustainable:

Service	Local Cost	Cloud Equivalent
LLM inference	€0 (electricity)	€100-500/mo (API usage at similar volume)
PostgreSQL	€0 (Docker)	€50-200/mo (managed Postgres)
Elasticsearch	€0 (Docker)	€100-300/mo (Elastic Cloud)
n8n	€0 (self-hosted)	€24-200/mo (n8n cloud)
Mattermost	€0 (self-hosted)	€0-50/mo (limited free tier)
Monitoring	€5/mo (VPS)	€20-50/mo (Datadog, etc.)
Total	~€5/mo	€300-1,300/mo

But electricity is real. The Ryzen 5 2600X idles around 65W, the Mac mini M4 around 5-7W (rising to ~30W during inference). Call it 100W average for the whole setup. At Dutch electricity prices (~€0.25/kWh), that's about €25-30/month.

Real total: ~€35/month versus a minimum of €300/month in cloud services. And I went from four AI subscriptions down to one.

What's Actually Running: 22 Containers

The VPS runs separately with just 2 containers: Nginx Proxy Manager for webhook ingress and Uptime Kuma for external monitoring. Everything else is on docker-host.

The RAM Reality

Here's the part nobody shows you in tutorials: how 32GB actually gets divided:

Windows 11 OS overhead:              ~4 GB
Elasticsearch (Java heap):            4 GB  (-Xms4g -Xmx4g)
n8n (Node.js):                      ~4-6 GB typical usage
PostgreSQL:                          ~1 GB
Mattermost:                         ~0.5 GB
7x Python services:                  ~2 GB total
Other containers:                    ~1 GB
Docker engine overhead:              ~1 GB
─────────────────────────────────────────
Total:                              ~18-20 GB typical, ~30 GB under load

The n8n allocation deserves explanation. It's configured with NODE_OPTIONS=--max-old-space-size=16384, a 16GB ceiling. That sounds aggressive, but without it, Node.js defaults to a much lower heap limit. When a workflow processes a batch of large email bodies through an LLM and the responses come back as multi-kilobyte JSON objects, memory spikes fast. If the heap limit is too low, Node's garbage collector starts running constantly, trying to free memory instead of doing actual work. Eventually, the process crashes with an out-of-memory error. The high ceiling gives it room to breathe. In practice, n8n uses 4-6GB.

The real constraint isn't peak usage; it's that everything competes for the same memory bus. When Elasticsearch is indexing, n8n is running 16 workflows, and PostgreSQL is handling a complex CTE query simultaneously... things slow down. Nothing crashes, it just slows down.

Ollama on the Mac Mini: The Inference Layer

The M4's unified memory architecture is genuinely excellent for LLM inference. Unlike discrete GPUs, where you're limited by VRAM (my GTX 1060's 3GB is useless for anything beyond tiny models), the M4 can use its full 24GB for model weights. The memory bandwidth (120 GB/s) is lower than a high-end GPU, but for a 14B parameter model, it's more than enough.

I run a tiered model strategy:

Tier	Model	Size	Use Case	Speed
Classification	qwen2.5:14b	~9 GB	Email triage, transaction categorization	1-2s
Reasoning	qwen3:14b	~9.3 GB	Judgment calls, tool use, knowledge extraction	1.5-3s

Two more tiers are planned but not yet running locally:

Generation (qwen3:32b) — for client-facing content where quality matters. Needs a GPU with more VRAM than what I currently have.
Vision (llama3.2-vision:11b) — for screenshot comparison and 3D print quality inspection. Planned for when the system matures enough to need it.

With ~17GB available for models, I can only run one at a time. The keep-alive is set to 10 seconds; models unload quickly to free RAM for the next one. The flow looks like:

Classification batch starts → qwen2.5:14b loads (~4 second cold start)
Processes 10-50 emails → model stays warm
Batch finishes → 10 seconds idle → model unloads
Brain needs to reason → qwen3:14b loads
Brain finishes → unloads

This works because classification and reasoning don't overlap much. The classifier runs on a schedule; the brain runs on events. The 4-second cold start is acceptable. If I had 48GB of unified memory, I'd keep both warm permanently, but the M4 with 24GB was the sweet spot for price/performance.

The Logging Proxy

One of the more useful things I built is an HTTP proxy that sits between all consumers and Ollama:

# Proxy sits between all services and Ollama
# Every inference call gets logged to PostgreSQL
INFERENCE_ENDPOINTS = {"/api/generate", "/api/chat", "/api/embed"}
POLL_ENDPOINTS = {"/api/tags", "/api/ps", "/api/version"}

Every inference request gets logged with full token counts, latency, and caller info. The logging happens in a daemon thread, so it doesn't block the response. This means I can query the usage table to see exactly which service is consuming the most tokens, what the average latency is, and which workflows are the heaviest users.

All containers talk to the proxy. They never hit Ollama directly. This gives me a single point of observability for all LLM traffic across the system.

How the Machines Find Each Other

Tailscale gives each machine a stable IP that works regardless of the physical network. No port forwarding. No dynamic DNS. No opening ports on the home router.

Docker containers on the docker-host reach the inference server's Ollama through the Tailscale IP:

# docker-compose.yml (simplified, IPs redacted)
api:
    environment:
      OLLAMA_URL: http://<inference-tailscale-ip>:11433
      DATABASE_URL: postgresql://user:pass@postgres:5432/db

Services on the same Docker host use Docker service names (e.g., http://postgres:5432). Cross-machine communication goes through Tailscale IPs.

I also run CoreDNS inside Docker for internal subdomain routing, friendly names like dashboard.internal, api.internal, all resolving to Tailscale IPs within the mesh only. One thing worth knowing if you set this up: CoreDNS in authoritative mode doesn't fall through to external DNS for missing records; it returns NXDOMAIN. So every new internal subdomain needs to be added to the zone file, or it simply won't resolve.

The Memory Mystery

The 32GB of DDR4 in docker-host is two 16GB kits of Corsair Vengeance RGB Pro, rated at 3200MHz. Same model number, same batch number; one kit bought in 2018, one in 2022. They should be as compatible as two kits can physically be.

They aren't. I've set XMP to 3200MHz multiple times. With the original single kit, I even ran a stable overclock at 3600MHz. But since adding the second kit, the profile either fails to apply or reverts to JEDEC default 2133MHz after some time. No error, no BSOD, just silently drops back.

So right now, 32GB of 3200MHz-rated memory is running at 2133MHz. That's roughly 33% of the memory bandwidth sitting unused. Every container, every query, every Docker layer pull. All running at two-thirds speed on the memory bus.

I haven't fully diagnosed whether it's a subtle timing incompatibility between the kits, a motherboard limitation with four DIMMs populated, or something else entirely. It's on the list, but it's the kind of issue that requires dedicated downtime to troubleshoot properly, and downtime means taking 22 containers offline.

What I'd Change If I Started Over

Linux instead of Windows on docker-host. Docker on Windows works, but it adds friction everywhere. My deploy script runs PowerShell commands over SSH (Remove-Item -Recurse -Force instead of rm -rf). I once corrupted a CoreDNS zone file because PowerShell's -replace treats \n as literal text instead of a newline. Linux would eliminate an entire category of issues.

A dedicated, purpose-built server. The current machine has three problems: it's not built for this job, it's not efficient at this job, and it has competing use cases.

The docker-host is also my occasional Windows machine (I still use it for things that need Windows). That means I can't wipe it to Linux, and it means the machine is pulling double duty when it should be dedicated infrastructure. In an ideal setup, Docker lives on its own box that I never touch except to SSH into.

The hardware itself is wasteful for a container host. The Ryzen 5 2600X pulls 95W TDP. Those 12 threads are genuinely useful when n8n, PostgreSQL, and Elasticsearch all spike at once, but most of the time, containers are waiting on I/O, not burning CPU. An Intel i5-12500T at 35W would handle the same workload. Then there's the GTX 1060 drawing 120W under load for absolutely nothing; it's only installed because the Ryzen has no integrated graphics. And the 650W PSU is running at maybe 20% load, which is the least efficient part of its power curve. The whole machine is basically optimized for gaming, not for sitting in a corner running Docker.

My ideal replacement: something like a Dell OptiPlex 3080 Micro — small form factor, Intel with integrated graphics (no discrete GPU needed), 16GB RAM (expandable), designed for 24/7 operation, near-silent. These go for reasonable prices secondhand, though RAM pricing makes anything above 16GB expensive. It wouldn't match the Ryzen's raw multi-threaded output, but for a Docker host that's mostly waiting on I/O and network, it doesn't need to.

48GB on the Mac mini. The 24GB M4 is good, but being limited to one model at a time creates a scheduling bottleneck. With 48GB I could keep the classifier and the reasoning model warm simultaneously and cut out the cold-start latency entirely.

Start with Elasticsearch earlier. I started with ChromaDB for vector search because it's lighter. But once I needed hybrid search (keyword + semantic in the same query), I had to migrate anyway. If your data has both structured metadata and unstructured text (and you know you'll need to search both), start with something that handles both natively. That said, if you only need vector similarity for a smaller dataset, ChromaDB or pgvector will save you 2GB of RAM and a lot of query DSL.

The Control Argument

Beyond cost and privacy, there's a third reason I run local-first: I own the upgrade timeline.

I decide when to update Postgres. When Elasticsearch changes licensing, it doesn't affect my running instance. When n8n raises cloud pricing, it doesn't matter. When a model provider deprecates an API version, my workflows keep running.

I've been bitten by the alternative. I originally planned to use a specific open banking provider for transaction imports. They closed to new signups months after I started planning around them. Because my architecture is local-first, switching to a different provider was a contained change, one API integration, not a full re-architecture.

Is This For You?

Honest answer: probably not, if you're building a side project or a startup MVP. The setup cost in time is real. Docker Compose files don't write themselves, Tailscale needs configuring, and you'll spend a weekend debugging why a Python service can't reach Elasticsearch through Docker's bridge network.

If your data is genuinely sensitive and you have ongoing infrastructure needs, and you don't mind being your own sysadmin, it's worth considering. If you need to scale past what consumer hardware handles, or you have a team that needs managed infrastructure, or you'd rather write application code than debug Docker networking at midnight, stick with cloud services. There's no shame in that — it's a legitimate trade-off.

For me, €35/month, zero data leaving the house, and full control over every component is worth being my own sysadmin, DBA, and on-call engineer. For a solo operation, that math works.

This is Part 1 of "One Developer, 22 Containers" — a series about building an AI office management system on consumer hardware. Next up: the technology decisions behind every major component, what I considered, and what I'd pick differently today.

If you're building something similar or have questions about any of the stack, I'd love to hear about it in the comments. You can also find me on GitHub.