DEV Community: 우병수

Modularized Workflow Toggles in GitHub Actions: Cutting CI/CD Run Time Without Losing Control

우병수 — Mon, 13 Jul 2026 08:10:39 +0000

TL;DR: The most expensive CI failure isn't a flaky test — it's running the full pipeline on a commit that changed two lines in README. md.

📖 Reading time: ~21 min

What's in this article

The Problem: Every Push Runs Everything, and It's Killing Your Feedback Loop
The Three Toggle Mechanisms Worth Understanding
Building the Toggle Layer: A Real Orchestrator Workflow
Reusable Workflows as the Toggle Target: Structure and Secrets Passing
Matrix Toggles: Scoping Environments and Platforms Dynamically
Self-Hosted Runner Considerations When Using Toggles
When This Pattern Breaks Down (and What to Use Instead)

The Problem: Every Push Runs Everything, and It's Killing Your Feedback Loop

The most expensive CI failure isn't a flaky test — it's running the full pipeline on a commit that changed two lines in README.md. A typical monolithic .github/workflows/ci.yml queues lint, unit tests, integration tests, Docker build, push to registry, deploy to staging, and a Slack notification for every single push to every branch. That setup makes sense for exactly one scenario: your first week before you have enough complexity to care. After that, it's billing you for work that produces zero signal.

The operator cost is concrete. GitHub-hosted runners bill per minute, and a full pipeline that takes 18 minutes on a docs-only commit is 18 minutes of pure waste — multiplied by every engineer pushing WIP commits. On a self-hosted runner, the billing is electricity and thermal load on hardware you own. My local runner sits on the same workstation running Ollama inference; a needless Docker multi-stage build competes with whatever model is warming up. Idle CPU cycles aren't free when the machine has a day job. The waste isn't abstract — it shows up in slower feedback, higher costs, and runners that are busy when you actually need them.

The fix isn't splitting into a dozen separate workflow files and maintaining them all independently. That trades one problem for another — diverging logic, copy-pasted steps, and no single place to audit what actually runs. The better model is workflow toggles: a combination of workflow_dispatch inputs, path-based conditionals, reusable workflows called with workflow_call, and matrix include/exclude blocks that let you dial exactly what executes based on who triggered the run, what changed, and what the target environment is. A push to a feature branch with only docs/** changes should run spell-check and nothing else. A manual dispatch with deploy: true should run the full chain. The same YAML, scoped by context.

This kind of conditional orchestration is one layer of a broader automation stack. If you're thinking about where CI toggles fit alongside event-driven pipelines, webhook triggers, and cross-system workflows, the guide on Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines covers the wider space and is worth reading alongside this one.

The Three Toggle Mechanisms Worth Understanding

The mechanism most developers reach for first — workflow_dispatch boolean inputs — is also the one most often misused. It gives you explicit, human-readable control over which stages run, and it works equally well when called via the GitHub API for programmatic triggers. The YAML is straightforward, but the if expression syntax trips people up because it's not quite standard boolean comparison:

on:
  workflow_dispatch:
    inputs:
      run_deploy:
        description: "Deploy to production after build"
        type: boolean
        default: false
      target_env:
        description: "Target environment"
        type: choice
        options:
          - staging
          - production
        default: staging

jobs:
  deploy:
    runs-on: ubuntu-latest
    # inputs.run_deploy returns the string "true", not boolean true —
    # the == 'true' comparison is intentional, not a bug
    if: inputs.run_deploy == 'true'
    steps:
      - name: Deploy
        run: ./scripts/deploy.sh ${{ inputs.target_env }}

That string-vs-boolean distinction catches everyone once. The type: boolean declaration makes the UI render a checkbox, but the value arrives in if expressions as the string 'true' or 'false'. Use == 'true', not == true. Also worth knowing: if you trigger the same workflow via push (not workflow_dispatch), inputs.run_deploy evaluates to empty string, which means every if: inputs.run_deploy == 'true' gate silently skips — that's usually what you want, but only if you intended it.

Path filters solve a different problem: not "should this stage run" but "should this workflow run at all given what changed." If your monorepo has a docs/ directory and a src/ directory, there's no reason a markdown edit should trigger your full build and test matrix. The paths filter handles this at the trigger level, before any job ever starts:

on:
  push:
    branches: [main, "release/**"]
    paths:
      - "src/**"
      - "packages/**"
      - ".github/workflows/ci.yml"  # workflow changes should always re-run
    paths-ignore:
      # paths-ignore and paths are mutually exclusive — pick one per trigger
      # listed here only for documentation; remove if using paths above
      - "docs/**"
      - "*.md"
      - ".gitignore"

One gotcha: paths and paths-ignore are mutually exclusive on the same trigger block. If you mix them, GitHub silently drops one. Use paths with an allowlist or paths-ignore with a denylist — not both. Also include the workflow file itself in your paths allowlist, otherwise a change to the CI config won't trigger a run, which makes debugging infuriating.

Reusable workflows via workflow_call are the mechanism that actually enables modular composition at scale. The pattern is an orchestrator workflow that conditionally calls sub-workflows based on job-level if conditions:

# .github/workflows/orchestrator.yml
on:
  workflow_dispatch:
    inputs:
      run_integration_tests:
        type: boolean
        default: false

jobs:
  build:
    uses: ./.github/workflows/build.yml
    # no condition — build always runs

  integration:
    needs: build
    # only call the expensive sub-workflow when explicitly requested
    if: inputs.run_integration_tests == 'true'
    uses: ./.github/workflows/integration-tests.yml
    with:
      node_version: "20"
    secrets: inherit

  deploy:
    needs: [build, integration]
    # needs: integration — but integration might be skipped.
    # result == 'skipped' must be handled or this job never runs.
    if: |
      always() &&
      needs.build.result == 'success' &&
      (needs.integration.result == 'success' || needs.integration.result == 'skipped')
    uses: ./.github/workflows/deploy.yml
    secrets: inherit

That always() + skipped-result pattern in the deploy job is the part nobody documents. If a needs dependency was skipped, GitHub marks the dependent job as skipped too — automatically, without running your if condition. The only escape is wrapping the condition in always() and explicitly allowing 'skipped' as an acceptable upstream result. These three mechanisms also interact in ways that compound quickly: a path filter can suppress the trigger entirely so workflow_dispatch inputs never exist; a skipped upstream job breaks naive needs chains; and an orchestrator calling a reusable workflow can't inspect the sub-workflow's internal job results, only its overall conclusion. Document which toggles are active and why at the top of every orchestrator workflow — a three-line comment block saves an hour of trace-reading two weeks later.

Building the Toggle Layer: A Real Orchestrator Workflow

The most counter-intuitive part of building a toggle layer isn't the boolean inputs — it's what happens downstream when a job gets skipped. GitHub Actions' default needs: behavior treats a skipped upstream job the same as a failed one, which means your entire pipeline silently dies the moment you flip a toggle off. That gotcha alone has caused more broken pipelines than any misconfigured secret. Fix that first, build the rest around it.

Here's the full orchestrator skeleton. Four boolean inputs, all defaulting to false so a bare workflow_dispatch trigger does nothing unless you explicitly opt in:

name: Orchestrator

on:
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      run_lint:
        description: "Run linting"
        type: boolean
        default: false
      run_test:
        description: "Run tests"
        type: boolean
        default: false
      run_build:
        description: "Run build"
        type: boolean
        default: false
      run_deploy:
        description: "Run deployment"
        type: boolean
        default: false

jobs:
  lint:
    # On push, always lint. On manual dispatch, only if toggled on.
    if: github.event_name == 'push' || inputs.run_lint
    uses: ./.github/workflows/lint.yml
    secrets: inherit

  test:
    # Same dual-trigger pattern — push always runs, dispatch checks the toggle.
    if: github.event_name == 'push' || inputs.run_test
    needs: lint
    # This is the critical fix: accept 'skipped' as a valid upstream state.
    # Without this, skipping lint kills test even when lint was intentionally off.
    if: |
      (github.event_name == 'push' || inputs.run_test) &&
      (needs.lint.result == 'success' || needs.lint.result == 'skipped')
    uses: ./.github/workflows/test.yml
    secrets: inherit

  build:
    needs: [lint, test]
    if: |
      (github.event_name == 'push' || inputs.run_build) &&
      (needs.lint.result == 'success' || needs.lint.result == 'skipped') &&
      (needs.test.result == 'success' || needs.test.result == 'skipped')
    uses: ./.github/workflows/build.yml
    secrets: inherit

  deploy:
    needs: [build]
    if: |
      (github.event_name == 'push' || inputs.run_deploy) &&
      (needs.build.result == 'success' || needs.build.result == 'skipped')
    uses: ./.github/workflows/deploy.yml
    secrets: inherit

The github.event_name == 'push' || inputs.run_lint pattern does real work here. On a push to main, all four stages run unconditionally — the toggles are irrelevant. On a manual workflow_dispatch, each stage is opt-in. One YAML file covers both the automated path and the surgical "just redeploy" path a developer needs at 11pm. The alternative — separate YAML files for manual vs automated — sounds clean until you have to keep them in sync across three months of changes.

The needs: chaining deserves more attention than the docs give it. When you write needs: lint and lint gets skipped, GitHub marks the downstream job as skipped too, without evaluating its if: condition at all — unless you've written that condition to handle the skipped result explicitly. The exact pattern that actually works:

# This handles three valid upstream states: job ran and passed,
# job was skipped by toggle, or job doesn't exist in this event context.
if: |
  needs.lint.result == 'success' || needs.lint.result == 'skipped'

Do not use if: always() as a blanket fix here. always() will run the downstream job even when the upstream genuinely failed, which destroys your failure signal. The needs.X.result == 'skipped' check is surgical — it unblocks the toggle path while preserving real failure propagation. For the deploy job specifically, you want to be even stricter: if build failed (not skipped), deploy should never run regardless of toggles. The condition above handles that correctly because a real failure returns 'failure', not 'skipped'.

Reusable Workflows as the Toggle Target: Structure and Secrets Passing

The part most teams get wrong first: they treat reusable workflows like shared bash scripts — useful, but dumb. The workflow_call trigger makes them something closer to typed function signatures. When you define inputs and secrets blocks directly in the called workflow file, that file becomes self-documenting in a way that a raw composite action or a script-with-args pattern never quite achieves. The orchestrator that calls it is forced to satisfy a contract. That's the actual value — not code reuse, but enforced interface boundaries between your pipeline stages.

Here's a minimal deploy.yml that accepts an environment choice and inherits secrets from the caller:

# .github/workflows/deploy.yml
on:
  workflow_call:
    inputs:
      environment:
        description: "Target deployment environment"
        type: choice
        options:
          - staging
          - production
        required: true
    # secrets: inherit is the shortcut — the called workflow sees everything
    # the caller has access to, without you explicitly naming each secret.
    # Use it during early development or when the secret set is stable and
    # well-understood. Switch to explicit mapping before the workflow is
    # called from more than one orchestrator with different secret scopes.
    secrets:
      inherit

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}
    steps:
      - uses: actions/checkout@v4
      - name: Deploy
        run: ./scripts/deploy.sh
        env:
          # secrets.DEPLOY_KEY is available because of secrets: inherit
          DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}
          TARGET_ENV: ${{ inputs.environment }}

The secrets: inherit shortcut is convenient but carries a real tradeoff: it implicitly grants the called workflow access to every secret visible in the calling context, including ones it has no business touching. If your orchestrator has PROD_DATABASE_URL and STRIPE_SECRET_KEY in scope and you call a build workflow that only needs REGISTRY_TOKEN, you've silently over-permissioned it. Explicit secret mapping is more verbose but documents exactly what's being passed:

# in the orchestrator (caller):
jobs:
  call-deploy:
    uses: ./.github/workflows/deploy.yml
    with:
      environment: production
    secrets:
      DEPLOY_KEY: ${{ secrets.DEPLOY_KEY }}
      # Only DEPLOY_KEY crosses the boundary. Nothing else.

The output mechanism is where reusable workflows stop feeling like isolated scripts and start behaving like pipeline stages with return values. A called workflow can surface data back to the orchestrator via workflow_call.outputs mapped from a job's outputs. The pattern that actually matters in practice: a build workflow that pushes a container image emits the immutable digest, and the deploy workflow consumes it — so you're never deploying "latest" by accident, you're deploying a specific SHA-pinned artifact.

# in build.yml (called workflow)
on:
  workflow_call:
    outputs:
      image_digest:
        description: "The pushed image digest (sha256:...)"
        value: ${{ jobs.build.outputs.digest }}

jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      digest: ${{ steps.push.outputs.digest }}
    steps:
      - uses: actions/checkout@v4
      - name: Build and push
        id: push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/myorg/myapp:${{ github.sha }}
      # docker/build-push-action emits `digest` as a step output automatically

# in orchestrator.yml (caller consuming the output)
jobs:
  build:
    uses: ./.github/workflows/build.yml

  deploy:
    needs: build
    uses: ./.github/workflows/deploy.yml
    with:
      environment: staging
      image_digest: ${{ needs.build.outputs.image_digest }}

The nesting limitation is the sharp edge that trips up anyone who tries to build a genuinely modular pipeline from reusable workflows. As of current GitHub Actions behavior, a called workflow cannot itself call another reusable workflow. The chain is flat: orchestrator calls workflow A, and workflow A cannot then call workflow B as a reusable workflow — it would have to inline those steps or trigger them through a separate mechanism. The documentation uses language about composability that implies deeper nesting works, but it doesn't. The practical consequence is that your orchestrator ends up owning more coordination logic than feels clean, and "reusable" workflows max out at one level of abstraction. If you find yourself wanting to compose two called workflows together, the honest answer is to either inline the second one into the first, or use composite actions (which can nest) for the leaf-level step logic and reserve reusable workflows for the job-level orchestration boundary.

Matrix Toggles: Scoping Environments and Platforms Dynamically

The most underused feature in GitHub Actions matrix configuration is treating the matrix itself as a runtime input rather than a static list baked into the workflow file. Instead of commenting out environments or maintaining a dozen nearly-identical workflow files, you can pass a JSON array through workflow_dispatch and let the matrix expand dynamically at runtime. The entry point looks deceptively simple:

on:
  workflow_dispatch:
    inputs:
      target_envs:
        description: 'JSON array of target environments'
        required: false
        default: '["staging"]'
        type: string

jobs:
  deploy:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        env: ${{ fromJson(inputs.target_envs) }}
    steps:
      - name: Deploy to ${{ matrix.env }}
        run: ./scripts/deploy.sh --env ${{ matrix.env }}

The default of '["staging"]' matters: it means a bare trigger with no inputs filled in still produces a valid single-vector matrix rather than blowing up the job. When you need to hit production and canary simultaneously, you pass ["production","canary"] through the dispatch UI or the API payload. No branch gymnastics, no hardcoded job duplicates. The fromJson() call is doing the heavy lifting — it converts the string input into an actual array that the matrix engine can iterate over.

Once you have dynamic environments, you'll want certain steps to fire only for specific ones. That's where matrix.include and conditional step guards work together. A common pattern: smoke tests should only run against production because staging data is inconsistent and you don't want false alarms blocking the pipeline:

jobs:
  deploy:
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        env: ${{ fromJson(inputs.target_envs) }}
        include:
          # Attach extra config only when env is production
          - env: production
            run_smoke: true
          - env: staging
            run_smoke: false
    steps:
      - name: Deploy
        run: ./scripts/deploy.sh --env ${{ matrix.env }}

      - name: Smoke Tests
        if: matrix.run_smoke == true
        run: ./scripts/smoke.sh --env ${{ matrix.env }}

matrix.include here acts as a lookup table — it attaches the run_smoke flag to each matrix cell based on the environment value. The step-level if condition then reads that flag. One gotcha: if you pass an environment that has no matching include entry, the variable is undefined rather than false — so write your if guard as matrix.run_smoke == true rather than != false, or you'll get unexpected step execution when new environments get added to the input array without a corresponding include entry.

Now the sharp edge: pass an empty array ([]) and the job fails immediately with "matrix must define at least one vector" — no graceful skip, just a red job. This happens more than you'd expect when upstream logic generates the input programmatically and produces an empty result set. The fix is a job-level guard that runs before the matrix ever expands:

jobs:
  # Gate job: check if the matrix input is non-empty before expanding
  matrix-guard:
    runs-on: ubuntu-22.04
    outputs:
      has_targets: ${{ steps.check.outputs.has_targets }}
    steps:
      - id: check
        run: |
          ENVS='${{ inputs.target_envs }}'
          COUNT=$(echo "$ENVS" | jq 'length')
          echo "has_targets=$([ "$COUNT" -gt 0 ] && echo 'true' || echo 'false')" >> $GITHUB_OUTPUT

  deploy:
    needs: matrix-guard
    if: needs.matrix-guard.outputs.has_targets == 'true'
    runs-on: ubuntu-22.04
    strategy:
      matrix:
        env: ${{ fromJson(inputs.target_envs) }}
    steps:
      - name: Deploy to ${{ matrix.env }}
        run: ./scripts/deploy.sh --env ${{ matrix.env }}

The matrix-guard job uses jq length to count array elements and publishes a boolean output. The deploy job's if condition reads that output via needs.matrix-guard.outputs.has_targets and skips the entire job — including matrix expansion — when the array is empty. This means the workflow shows as skipped rather than failed, which is the correct behavior when "no environments selected" is a valid state, not an error condition.

Self-Hosted Runner Considerations When Using Toggles

The thermal angle gets overlooked almost entirely in CI documentation. On a self-hosted runner that shares hardware with other services — say, an Ollama inference server or an n8n Docker stack — a job that runs but does nothing still consumes queue slots, spins up the runner process, and generates heat from the CPU doing bookkeeping. A toggle that skips the job entirely at the workflow level means the runner never receives the job assignment. That's not a minor optimization; on a machine where the GPU is already warm from LLM inference, preventing an unnecessary CUDA context initialization from a build job genuinely matters.

Tagging Runners by Capability, Not Just "Self-Hosted"

The default advice — slap runs-on: self-hosted on everything — causes real pain once you have more than one runner or more than one job type. The fix is capability labels. When you register a runner, you assign labels during setup or via the GitHub UI. A GPU-equipped machine gets [self-hosted, gpu, linux]. A lightweight runner on a cheap VPS gets [self-hosted, linux]. Then your workflow targets labels with intent:

jobs:
  docs-lint:
    # Docs linting has zero reason to touch the GPU runner.
    # This label set will never match the gpu-tagged machine.
    runs-on: [self-hosted, linux]
    if: ${{ inputs.run_docs_lint == 'true' }}
    steps:
      - uses: actions/checkout@v4
      - run: npx markdownlint-cli2 "**/*.md"

  model-eval:
    # Only dispatched to the machine that actually has CUDA available.
    runs-on: [self-hosted, gpu, linux]
    if: ${{ inputs.run_model_eval == 'true' }}
    steps:
      - uses: actions/checkout@v4
      - run: python eval/run_benchmarks.py

A docs-lint job will never queue on your GPU runner because GitHub Actions requires all specified labels to match. This is the label-as-capability-gate pattern — you're not just routing, you're enforcing resource access at the dispatch layer. Combined with a toggle that skips model-eval entirely when you're pushing a docs change, the GPU runner stays free for the work that actually needs it.

Concurrency Controls on Top of Toggles

Toggles prevent unneeded types of jobs from running. Concurrency controls prevent the same job from stacking when someone triggers workflow_dispatch twice in a row, or when a push and a manual dispatch land close together. These two concerns are orthogonal and both need to be in the config. The group key should encode both the branch ref and the specific toggle inputs so that a deploy-enabled dispatch doesn't cancel a non-deploy dispatch on the same branch:

jobs:
  deploy:
    runs-on: [self-hosted, linux]
    if: ${{ inputs.run_deploy == 'true' }}
    concurrency:
      # Group key ties together: the workflow name, the branch, and whether deploy is on.
      # Two deploy runs on main will cancel-and-replace each other.
      # A deploy run and a lint-only run are in different groups and don't interfere.
      group: ${{ github.workflow }}-${{ github.ref }}-deploy-${{ inputs.run_deploy }}
      cancel-in-progress: true
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy.sh

Without inputs.run_deploy in the group key, a lint-only run (where run_deploy is false) and a full deploy run can cancel each other out — which is not what you want. The group key needs to be specific enough that only genuinely redundant runs collapse.

The Idle Timeout Phantom Offline Problem

This one doesn't appear in the GitHub Actions runner docs prominently, but you'll hit it within a week of running toggles on a self-hosted setup managed by PM2 or systemd. When a skipped job ends immediately — because every step is gated by an if condition that evaluates false — the runner completes the job in seconds and goes idle. If your runner's idle timeout is shorter than the interval between jobs, the GitHub Actions service marks the runner as offline. You then see phantom "offline" states in the Settings → Actions → Runners panel, and queued jobs fail to dispatch.

Two practical fixes. First, use the ACTIONS_RUNNER_HOOK_JOB_COMPLETED environment variable to fire a keepalive or logging script at the end of every job — this resets the runner's internal idle clock. Second, and simpler: bump the idle timeout in the runner's .env file before starting the service:

# In the runner's root directory: .env (create if not present)
# Default idle timeout is 50 seconds in older runner versions.
# Bump it to avoid phantom offline during toggle-heavy workflows.
RUNNER_IDLE_TIMEOUT=300

# If using the job-completed hook instead, point it at a no-op keepalive:
ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/opt/actions-runner/hooks/keepalive.sh

#!/bin/bash
# /opt/actions-runner/hooks/keepalive.sh
# Called by the runner after every job completes.
# Exists purely to produce a log line and reset the idle clock.
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] job completed hook fired, runner staying warm"

The hook approach is more surgical — it doesn't change idle behavior globally, just ensures the runner signals activity after every job regardless of how fast it finished. On a PM2-managed runner, make sure the hook script is executable and the PM2 process has the env variable set in its ecosystem config, not just in the shell where you ran pm2 start. Environment variables set in the shell are not inherited by PM2 processes started at boot unless they're declared in ecosystem.config.js or the .env file the runner process reads on startup.

When This Pattern Breaks Down (and What to Use Instead)

The silent green run is the most dangerous failure mode in toggle-based workflows. A new contributor runs gh workflow run ci.yml accepting all defaults, every input defaults to false, the workflow completes in 8 seconds with a green checkmark, and nothing actually ran. The fix is a required summary job that always executes and explicitly asserts that at least one stage was enabled — treat it like a guard clause, not an afterthought:

sanity-check:
  runs-on: ubuntu-latest
  needs: [lint, test, build, deploy]
  if: always()
  steps:
    - name: Assert at least one stage ran
      run: |
        # Fails the workflow if every toggle was false
        if [[ "${{ inputs.run_lint }}" == "false" && \
              "${{ inputs.run_test }}" == "false" && \
              "${{ inputs.run_build }}" == "false" && \
              "${{ inputs.run_deploy }}" == "false" ]]; then
          echo "::error::All stages disabled — this run did nothing. Enable at least one toggle."
          exit 1
        fi

For monorepos, toggles are often the wrong tool entirely. If what you actually need is "run the backend job only when files under services/api/ changed," you're modeling a path-scoping problem as an input problem. Native paths: triggers in GitHub Actions apply at the workflow level, not the job level — you can't say "trigger this job but not that one based on which files changed." The third-party action dorny/paths-filter fills that gap by outputting per-path boolean signals you can consume in job if: conditions. The tradeoff is a real one: you're adding a dependency on a third-party action with its own release cadence, and you're coupling your job graph to a paths-filter step that must run first. Worth it for a true monorepo with three or more distinct service roots; not worth it if your "monorepo" is two packages sharing a lockfile.

# dorny/paths-filter approach — jobs respond to actual file changes
- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      api:
        - 'services/api/**'
      frontend:
        - 'apps/web/**'

# Then in a downstream job:
build-api:
  needs: changes
  if: needs.changes.outputs.api == 'true'

A more subtle signal that the pattern is collapsing: you've written a script to decide which toggle values to pass to the workflow. The moment decision logic lives inside a shell script or composite action that computes inputs for another workflow, the workflow is no longer the source of truth — it's just an executor with a confusing interface. That logic belongs in an orchestrator. On my setup, this lands in n8n: a webhook receives the trigger event, a Function node evaluates branch name, changed paths, PR labels, and environment targets, then calls the GitHub API to dispatch the workflow with explicit inputs already resolved. The workflow itself stays dumb — it receives concrete boolean inputs and executes exactly what it's told. A small Node.js script via the Octokit REST client works equally well if n8n feels heavy for the use case.

There's also a hard architectural ceiling: workflow_dispatch inputs are capped at 10 per workflow in the current GitHub Actions spec. If you're at 8 toggles and considering adding more, that's not a configuration problem — it's a signal the workflow has accumulated too many responsibilities. The right move at that boundary is decomposition: split into multiple focused workflows (ci-quality.yml, ci-build.yml, ci-deploy.yml) called via workflow_call, each with their own narrow input surface. The toggle pattern works well in the range of 3–6 inputs for a single workflow; beyond that, you're managing complexity with more complexity.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Self-Hosted Log Fingerprinting Pipelines: Catching Anomalies Without Sending Logs to the Cloud

우병수 — Fri, 10 Jul 2026 08:10:41 +0000

TL;DR: Threshold-based alerting has a fundamental blind spot: it measures magnitude, not behavior. A sudden flood of HTTP 200s looks healthy to every rule you've probably written — no error rate spike, no latency breach, nothing trips.

📖 Reading time: ~24 min

What's in this article

The Problem: Your Logs Are Noisy and Your Alerting Is Dumb
Tool Stack and Why Each Piece Earns Its Place
Building the Pipeline: From Raw Log Line to Stored Fingerprint
Storing and Comparing Fingerprints in Qdrant
Wiring the Alert Path in n8n
Gotchas That Will Cost You an Afternoon
When This Architecture Is and Isn't the Right Call

The Problem: Your Logs Are Noisy and Your Alerting Is Dumb

Threshold-based alerting has a fundamental blind spot: it measures magnitude, not behavior. A sudden flood of HTTP 200s looks healthy to every rule you've probably written — no error rate spike, no latency breach, nothing trips. But if those 200s are all hitting /api/export from twenty IPs that appeared six hours ago, you have a data exfiltration pattern that sailed right past your PagerDuty rules. The same failure mode shows up in auth logs, queue workers, batch jobs — anywhere the shape of traffic matters more than the raw count. Threshold alerting asks "how many?" when the real question is "does this look like something we've seen before?"

The SaaS log aggregation pitch is compelling until you work out the economics at volume. Shipping gigabytes of application logs to Datadog or Splunk means paying ingestion costs that scale linearly with your verbosity, and those platforms reward you for logging less — which is exactly backwards from good observability practice. Beyond cost, there's the data residency question. Your logs contain internal hostnames, user IDs, API response bodies, stack traces with file paths, and whatever else your developers decided to dump into structured fields. That data leaving the building is a compliance surface you may not have formally assessed. The "just send everything to the cloud" approach trades operational convenience for a data exposure posture that's genuinely hard to audit.

Log fingerprinting takes a different angle entirely. Instead of writing regex patterns for every known error format — a game you will lose as soon as a library version changes its error messages — you cluster structurally similar log lines together and track the cluster space over time. A new cluster appearing is interesting. A known cluster whose volume doubles in ten minutes is interesting. A cluster that disappears entirely might mean a service stopped doing something it should be doing. None of this requires you to have predicted the specific log message in advance. The clusters emerge from the data, and deviations from the established cluster topology are what trigger alerts. This is why it beats regex whack-a-mole: you're not pattern-matching strings, you're modeling normal log behavior and flagging structural drift.

The pipeline that makes this work locally has four sequential stages. Raw log lines come in and get parsed into structured fields — timestamp, level, service, message body. The message body gets embedded into a dense vector using something like bge-m3, which handles the semantic similarity work that makes "connection refused" and "failed to connect" land near each other in vector space. Those vectors get clustered — HDBSCAN is a solid choice here because it doesn't require you to specify the number of clusters upfront and handles noise points well. Then each incoming batch of embedded vectors gets compared against the known cluster map: do these points fall into existing clusters, or are they carving out new territory? The alert layer only fires on that last question. The whole thing runs on local hardware, your logs never leave the machine, and the operational cost is VRAM and a Postgres instance, not a per-GB ingestion bill.

Tool Stack and Why Each Piece Earns Its Place

The most common mistake in log fingerprinting pipelines is over-engineering the ingestion layer. Elasticsearch clusters are expensive to run, operationally heavy, and genuinely overkill for fingerprinting workloads where you're not doing full-text search across petabytes — you're chunking log lines, embedding them, and doing nearest-neighbor lookups. Grafana Loki costs a fraction of that operationally: it indexes labels, not content, which means disk usage stays sane and memory pressure stays low. Promtail's Docker log scraping drops into a compose file without custom configuration gymnastics — point it at the Docker socket, define your label mappings, done.

# promtail-config.yml — minimal Docker log scraping
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      # use container name as the service label for LogQL filtering
      - source_labels: [__meta_docker_container_name]
        target_label: service
      - source_labels: [__meta_docker_container_log_stream]
        target_label: stream

On the embedding side, bge-m3 via Ollama is the right call for logs specifically because logs are multilingual garbage — stack traces mixed with German error messages, base64 blobs, hex addresses, vendor-specific tokens. Most embedding models fall apart on that. bge-m3's 1024-dimensional output handles the noise better than smaller models, and on a 32GB VRAM box it loads alongside a quantized inference model without eviction. The VRAM footprint for bge-m3 in Ollama sits around 1.2–1.5 GB depending on quantization, which is pocket change if you're already running a Q4 Mixtral or similar. The critical thing: you're embedding log templates, not raw lines — strip the timestamps and variable fields first, or your vector space becomes useless.

Qdrant earns its place because the operational model matches exactly what log fingerprinting needs. Named collections per service means your web app logs and your database logs don't pollute each other's similarity space — a connection timeout in Postgres and a 502 in Nginx have different cluster geometries. The HNSW index gives you approximate nearest-neighbor at latency that doesn't bottleneck the pipeline; on a local Docker deployment, payload queries with a vector search come back well under 10ms for collections in the hundreds of thousands of points. One real gotcha: Qdrant's default on_disk payload storage is off, so if you're storing full log line text as payload for retrieval, turn on on_disk_payload: true or RAM usage grows faster than expected.

# Create a named collection for a specific service
curl -X PUT http://localhost:6333/collections/nginx_logs \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 1024,
      "distance": "Cosine"
    },
    "on_disk_payload": true,
    "hnsw_config": {
      "m": 16,
      "ef_construct": 100
    }
  }'

n8n is where this stack avoids becoming a custom microservice project. The HTTP Request node hits the Ollama embedding API, another hits Qdrant's upsert endpoint, and the whole flow triggers either from a Loki webhook alert or a cron that polls LogQL for new log volume. No custom server code, no maintaining a Python service, no dependency hell — the glue layer is just JSON flowing between HTTP calls. The honest trade-off: n8n's execution model isn't designed for high-frequency streaming; if you're trying to fingerprint thousands of log lines per second in real time, you'll hit the webhook queue ceiling and need something like a dedicated consumer. For batch fingerprinting on a 5-minute cron or alert-driven triage, it handles the load without complaint.

Building the Pipeline: From Raw Log Line to Stored Fingerprint

The part most tutorials skip entirely: Loki ingestion without a sane label strategy produces a query surface so noisy that your downstream fingerprinting collapses into garbage clusters. Before touching embeddings, the labels you attach at scrape time determine whether your LogQL queries return coherent batches or mixed-signal soup. Get that wrong and every downstream fix is harder than it needs to be.

Step 1 — Promtail Config and Label Strategy

The minimal working Promtail config for Docker container logs uses the Docker socket to autodiscover containers, then applies relabeling to extract the labels that matter for fingerprinting. The three you actually need are job, container_name, and level. Everything else is noise in Loki's index unless you have a specific query need for it.

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      # container_name becomes the primary identity label
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: container_name
      # job groups containers by logical service (set via Docker label)
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: job
    pipeline_stages:
      - regex:
          # pull level out of structured JSON logs; adjust pattern for your format
          expression: '.*"level":"(?P[a-zA-Z]+)".*'
      - labels:
          level:

One gotcha that burned me early: Promtail's Docker SD requires the socket to be mounted read-only into the container. If you run Promtail rootless, file permission errors on /var/run/docker.sock are silent — Promtail starts, reports no errors, and just never scrapes anything. Mount it explicitly with --group-add $(stat -c '%g' /var/run/docker.sock) in your run command or add the GID to the Compose file. Also worth knowing: high-cardinality labels like trace_id or request_id will crater Loki's index performance fast. Keep the label set to those three unless you have a concrete reason to expand it.

Step 2 — Pulling Log Batches via LogQL HTTP API

Once logs are in Loki, the fingerprinting pipeline pulls batches on a cron cycle. The query_range endpoint is what you want — not query, which is for instant queries. The limit parameter caps lines per response, and start/end accept nanosecond epoch timestamps.

# Pull last 5 minutes of app logs, 200 lines max
curl 'http://loki:3100/loki/api/v1/query_range?query=\{job="app"\}&limit=200&start='"$(date -d '5 minutes ago' +%s%N)"'&end='"$(date +%s%N)"'&direction=forward'

The response is a JSON envelope where actual log lines live at .data.result[*].values[*][1] — the second element of each [timestamp, line] pair. In Node.js, extract with results.flatMap(s => s.values.map(v => v[1])). A common failure mode is hitting Loki's default query_timeout of 1 minute when limit is large and the label selector isn't selective enough. If your query spans multiple high-volume containers without a level filter, add one: {job="app", level="error"} cuts the result set dramatically and the fingerprints you actually care about are usually errors and warnings anyway.

Step 3 — Normalizing Log Lines Before Embedding

This is the step that makes or breaks fingerprint quality. Two log lines that are semantically identical — same error, different request ID — must hash to the same embedding neighborhood. Without normalization, bge-m3 treats them as distinct because they literally differ in token content. The transform strips the high-entropy variable parts and leaves the structural template.

// normalize.js — run each raw line through this before embedding
function normalizeLine(line) {
  return line
    // strip leading ISO timestamp (2024-01-15T12:34:56.789Z or similar)
    .replace(/^\d{4}-\d{2}-\d{2}T[\d:.]+Z?\s*/,'')
    // strip UUIDs (8-4-4-4-12 hex pattern)
    .replace(/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi,'')
    // strip IPv4 addresses (replace octets, keep structure visible)
    .replace(/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/g,'')
    // strip standalone numeric IDs (≥4 digits to avoid catching HTTP status codes)
    .replace(/\b\d{4,}\b/g,'')
    // collapse repeated whitespace
    .replace(/\s+/g,' ')
    .trim();
}

module.exports = { normalizeLine };

The threshold for numeric stripping matters: stripping all numbers kills useful signal like HTTP status codes (404, 500), which are exactly the kind of variance you want your fingerprints to capture as distinct clusters. That's why the pattern only strips four-or-more-digit numbers. Adjust that threshold based on what your logs actually contain — if your app logs database row counts that vary but the surrounding message is identical, you may want to go lower. After this transform, structurally identical lines from different requests collapse to the same string, which means bge-m3 will place them in essentially the same embedding space rather than scattering them by entropy.

Step 4 — Batching to Ollama's Embeddings Endpoint

bge-m3 via Ollama runs at the /api/embeddings endpoint, but that endpoint is single-input — one string per call. Batching here means N parallel requests or sequential chunked requests, not a true batch API call. On my 32GB VRAM box (RTX 3090), sequential calls to bge-m3 run around 80–120ms each for typical log line lengths. For 200 lines that's 16–24 seconds sequential, which is fine for a 5-minute cron but would miss a 1-minute window. The practical fix is groups of 20 with Promise.all — 10 parallel groups at ~100ms each gives you roughly 1 second of wall time for 200 embeddings.

// embed-batch.js
const OLLAMA_URL = 'http://localhost:11434/api/embeddings';
const BATCH_SIZE = 20;

async function embedLine(line) {
  const res = await fetch(OLLAMA_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    // bge-m3 is the model identifier as registered in Ollama
    body: JSON.stringify({ model: 'bge-m3', prompt: line }),
  });
  const data = await res.json();
  return data.embedding; // float32 array, 1024 dims for bge-m3
}

async function embedBatch(lines) {
  const results = [];
  // chunk into groups of BATCH_SIZE to avoid hammering the GPU queue
  for (let i = 0; i < lines.length; i += BATCH_SIZE) {
    const chunk = lines.slice(i, i + BATCH_SIZE);
    const embeddings = await Promise.all(chunk.map(embedLine));
    results.push(...embeddings);
  }
  return results; // parallel within chunk, serial across chunks
}

module.exports = { embedBatch };

One thing the Ollama docs don't emphasize enough: if you fire more than ~30 concurrent requests to a single bge-m3 instance, you'll start seeing timeout errors from the HTTP layer before the GPU is even the bottleneck. The request queue fills up at the Go HTTP server level. Keeping parallel inflight requests to 20 gives the queue room to breathe while still saturating the GPU. The resulting embeddings are 1024-dimensional float32 vectors — each one about 4KB on disk. For 200 log lines per run that's under 1MB per cycle, which stores fine in Postgres with the pgvector extension using a vector(1024) column.

Storing and Comparing Fingerprints in Qdrant

Most people reach for Elasticsearch when they need to store and query log patterns. That's fine if you already have it running, but Qdrant gives you something Elasticsearch doesn't: the similarity search is the primary operation, not an afterthought bolted onto a text index. For fingerprint storage specifically, that distinction matters — you're not keyword-searching logs, you're asking "how novel is this line relative to everything I've seen before?"

Collection Setup

One collection per major service is the right default. Cross-service similarity searches produce false negatives constantly — a database connection timeout and an HTTP gateway timeout can embed close together because the sentence structure is similar, even though they're completely unrelated failure modes. Keep them separate from day one; merging collections later is painful.

curl -X PUT http://localhost:6333/collections/log_fingerprints \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 1024,
      "distance": "Cosine"
    },
    "optimizers_config": {
      "memmap_threshold": 20000
    },
    "hnsw_config": {
      "m": 16,
      "ef_construct": 100
    }
  }'

The size: 1024 matches bge-m3's output dimension. If you're using a different embedding model, verify the output dimension before creating the collection — Qdrant will reject upserts that don't match the declared size, and the error message just says "wrong vector size" without telling you what it expected vs. what it got. The memmap_threshold config tells Qdrant to switch segments to memory-mapped files above 20k vectors, which keeps RAM usage from ballooning on a busy service log collection.

Upsert Pattern: Hash as Point ID

The critical detail here is deterministic point IDs. If you let Qdrant generate random UUIDs, every re-occurrence of the same log template creates a new point, your collection bloats, and your novelty scores degrade because the same pattern now has dozens of near-duplicate neighbors. Instead, hash the normalized template and use that as the ID. Qdrant accepts unsigned 64-bit integers as point IDs, so truncate the SHA-256 to 8 bytes.

import hashlib
import time

def template_to_point_id(template: str) -> int:
    # Truncate SHA-256 to 64-bit unsigned int — collision risk is
    # acceptable here; false deduplication is less harmful than bloat
    digest = hashlib.sha256(template.encode()).digest()
    return int.from_bytes(digest[:8], byteorder='big')

def upsert_fingerprint(client, collection: str, template: str, vector: list[float]):
    point_id = template_to_point_id(template)
    client.upsert(
        collection_name=collection,
        points=[{
            "id": point_id,
            "vector": vector,
            "payload": {
                "template": template,
                "first_seen": int(time.time()),  # only meaningful on first insert
                "hit_count": 1                   # increment separately via set_payload
            }
        }],
        # upsert semantics: existing point is fully replaced, not merged
        # so update hit_count with a separate set_payload call
    )

The payload stores the raw template and a first_seen epoch timestamp. On a true upsert (the point already exists), Qdrant replaces the payload entirely — it does not merge. If you want to increment a hit counter without overwriting first_seen, use the set_payload endpoint after checking whether the point existed. It's an extra round-trip but it's the correct pattern; trying to handle this in one call leads to timestamp drift where recurring patterns appear newly seen every time they're updated.

Similarity Search for Novelty Detection

The actual anomaly detection query is a top-k search with a score threshold applied client-side. Qdrant's score_threshold parameter cuts off results below the value, but for novelty detection you want to inspect the scores yourself rather than rely on the cutoff — you need to know if zero results came back because the line is genuinely novel or because the collection is empty.

def is_novel(client, collection: str, vector: list[float], threshold: float = 0.88) -> bool:
    results = client.search(
        collection_name=collection,
        query_vector=vector,
        limit=3,
        with_payload=False,  # skip payload fetch — we only need scores here
    )
    if not results:
        # empty collection or empty result: treat as novel
        return True
    # novel if every neighbor is below threshold
    return all(hit.score < threshold for hit in results)

The 0.88 threshold is a starting point, not a derived constant. bge-m3 with Cosine similarity tends to score structurally similar log lines — same verb, different values — in the 0.91–0.97 range. Lines that share only domain vocabulary (both mention "connection") but differ in structure tend to fall into the 0.75–0.87 range. So 0.88 sits in a real gap for many log corpora, but your logs may have a different distribution. Run the search across a week of known-normal traffic, plot the score histogram, and pick the threshold at the valley between the high-similarity cluster and the low-similarity tail. That's more defensible than any fixed number.

Payload Filtering to Avoid Stale Fingerprints

Without a time filter, a fingerprint from six months ago of a since-fixed bug will suppress anomaly detection for any log line that resembles it. The fix is scoping similarity searches to a recent window via payload filter. Qdrant evaluates the filter before computing similarity, so this doesn't add much overhead — it's not post-filtering.

import time

def search_recent(client, collection: str, vector: list[float], window_hours: int = 24):
    cutoff = int(time.time()) - (window_hours * 3600)
    return client.search(
        collection_name=collection,
        query_vector=vector,
        limit=3,
        query_filter={
            "must": [
                {
                    "key": "first_seen",
                    "range": {
                        "gte": cutoff
                    }
                }
            ]
        },
        with_payload=True,
    )

One non-obvious behavior: if your 24-hour window is narrow and the collection has low volume, you may get empty results for lines that genuinely appeared earlier today just because their first_seen timestamp is slightly outside the window due to clock skew between your log ingestion service and the machine running the upserts. Store timestamps in UTC epoch seconds everywhere and sync clocks. NTP drift of a few seconds is harmless; container time misconfiguration causing hours of offset is not, and it's more common than you'd expect in air-gapped or intermittently connected setups.

Wiring the Alert Path in n8n

The silent timeout is the first thing that will burn you. When the HTTP Request node in n8n calls Ollama for embeddings and the model is still warming up or the queue is deep, n8n drops the request without surfacing an error — it just moves on, and your embedding comes back empty. The fix is explicit: open the HTTP Request node options, find Timeout, and set it to 10000 ms minimum. The default is lower and undocumented in the UI. Miss this and you'll spend an hour wondering why your similarity scores are all zero.

The overall wiring uses two entry points. A Schedule Trigger fires every 5 minutes, queries Loki via HTTP Request (LogQL, range query, last 5-minute window), then hands off to a sub-workflow that runs normalize → embed → compare. Keeping this as a sub-workflow rather than one giant flow means you can trigger the same logic from the second entry point: a Webhook node that receives Loki ruler alerts for immediate processing when a rule fires mid-window. Both paths converge on the same normalize-embed-compare sub-workflow, so there's no logic duplication.

The Ollama embedding node config looks like this:

// HTTP Request node — Ollama embeddings
Method: POST
URL: http://ollama:11434/api/embeddings
Body (JSON):
{
  "model": "bge-m3",
  "prompt": "{{$json.normalized_line}}"
}
Options:
  Timeout: 10000          // must be explicit — silent drop otherwise
  Response Format: JSON   // Ollama returns { "embedding": [...] }

After the embedding returns, an IF node checks the cosine similarity score produced by the compare step against your threshold — a value you'll tune per log source, but starting around 0.82 is reasonable for normalized syslog-style lines. The branching is binary: high-score (known pattern) upserts the vector into Qdrant and exits cleanly; low-score (novel or anomalous) either posts to a Slack webhook with the raw line and score attached, or writes a row to a Postgres table for human review. I use the Postgres path for anything that fires during off-hours — it accumulates without paging anyone, and a morning query against the review table surfaces clusters of related anomalies better than a pile of Slack messages.

-- Postgres review table schema
CREATE TABLE log_anomalies (
  id          SERIAL PRIMARY KEY,
  captured_at TIMESTAMPTZ DEFAULT now(),
  service     TEXT,
  normalized  TEXT,
  raw_line    TEXT,
  similarity  FLOAT4,
  reviewed    BOOLEAN DEFAULT false
);

-- morning triage query: cluster by normalized template
SELECT normalized, COUNT(*), MIN(similarity), MAX(captured_at)
FROM log_anomalies
WHERE reviewed = false
GROUP BY normalized
ORDER BY COUNT(*) DESC;

The rate-limiting reality hits hard on any service generating more than 500 log lines per 5-minute window. bge-m3 on my 32GB box handles roughly 80–120 embed calls per minute before latency climbs enough to blow the cron budget. Three practical escapes: increase the cron interval to 15 minutes and accept the detection lag; sample the incoming batch (every Nth line after deduplication); or — the best option — pre-deduplicate by normalized template before the embed step. If ten log lines normalize to the same template, you only need one embedding. A Set node that builds a deduplicated map by template string before the HTTP Request loop cuts embed volume dramatically on noisy services and costs almost nothing in n8n processing time.

Gotchas That Will Cost You an Afternoon

The one that burns people most reliably: bge-m3 has a hard effective limit around 512 tokens, and a raw log line with an attached Java stack trace can hit that ceiling fast. The model won't throw an error — it silently truncates the input, which means your embedding represents partial content and your similarity scores become garbage for anything longer than a short event line. The fix is to embed the normalized template, not the raw line. Strip the variable parts (timestamps, hex addresses, UUIDs, numeric IDs), take the first ~300 characters of what's left, and embed that. You get a stable, compact representation of the log shape rather than a noisy snapshot of one occurrence.

Qdrant's HNSW index lives in memory by default. With a small collection this is invisible. Push past 500k vectors and you'll notice it the hard way — container restarts on your fingerprint store will stall for tens of seconds while the index reconstructs. Set on_disk: true in the vector config at collection creation time, not after the fact (you'd have to recreate the collection anyway):

PUT /collections/log_fingerprints
{
  "vectors": {
    "size": 1024,
    "distance": "Cosine",
    "on_disk": true   // HNSW graph stays on disk; slower ANN but restarts are instant
  },
  "hnsw_config": {
    "on_disk": true
  }
}

The latency tradeoff is real — expect slightly slower nearest-neighbor queries — but for a fingerprinting pipeline that tolerates a few hundred milliseconds per lookup, it's the right call. The alternative is a Qdrant container that takes 40 seconds to become healthy every time you update it, which makes deployment windows painful.

n8n's HTTP Request node does not stream responses, and it does not know your GPU is busy. If Ollama is mid-load serving another model when your embedding call arrives, that request can sit for 30+ seconds. n8n will time out and retry — and now you've sent the same log chunk to Qdrant twice with subtly different timing, which can create duplicate fingerprint entries if your upsert logic isn't idempotent. The fix isn't to increase the timeout (though you should). The fix is to generate a deterministic ID for each upsert — hash the normalized template string, use that as the Qdrant point ID. A SHA-256 of the first 300 characters of the template gives you a stable key:

import crypto from "crypto";

function fingerprintId(normalizedTemplate: string): string {
  // Using the template, not the raw line — same shape always maps to the same ID
  return crypto
    .createHash("sha256")
    .update(normalizedTemplate.slice(0, 300))
    .digest("hex")
    .slice(0, 16); // Qdrant accepts string IDs; 16 hex chars is plenty for uniqueness
}

Loki will silently lie to you about completeness. The default query limit is 5000 lines per request, and the API does not return a warning or a has_more flag when it truncates — you just get 5000 lines and a clean response. If your ingestion window during a busy period produces 12,000 log events, your fingerprint pipeline is working with 42% of the data without knowing it. Paginate explicitly using the start and end epoch nanosecond parameters, and keep fetching until you get a response shorter than your limit value:

# First page — 5000 line limit, 1-hour window
curl -G "http://loki:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={job="app-logs"}' \
  --data-urlencode "start=1700000000000000000" \
  --data-urlencode "end=1700003600000000000" \
  --data-urlencode "limit=5000" \
  --data-urlencode "direction=forward"

# If you got exactly 5000 results, advance `start` to the last result's timestamp + 1ns and repeat

The "advance start to last timestamp + 1 nanosecond" detail matters — if you use the exact last timestamp, Loki's range is inclusive and you'll re-ingest the final event on every page boundary, creating duplicate fingerprint candidates that inflate your similarity counts.

When This Architecture Is and Isn't the Right Call

The architecture earns its keep in a specific, narrow band of use cases. Outside that band, it becomes a liability — either an underpowered toy or an architectural bottleneck. Getting honest about the boundaries before you build saves you from ripping it out six months later.

The sweet spot is a home lab or small production deployment running somewhere between 5 and 20 Docker services. At that scale, you're generating enough log volume to make manual review painful, but not so much that the embedding step becomes a chokepoint. You get semantic anomaly detection — the kind that catches a novel failure pattern that substring matching would miss entirely — without routing your logs through a cloud ingestion service that charges per gigabyte and retains your data on someone else's infrastructure. If you're running Postgres, Nginx, a few API containers, and a handful of background workers, this pipeline handles that comfortably on modest hardware.

The air-gap case is the other clean fit. Regulated environments, on-premise deployments, anything where logs are considered sensitive operational data — the entire stack runs locally. Ollama serves the model, bge-m3 handles embeddings, Qdrant or a local Postgres vector column stores the fingerprints, and nothing phones home. Compare that to shipping logs to Datadog or Elastic Cloud: you've now made your internal service topology, error messages, and timing patterns visible to a third-party SaaS. For compliance-conscious operators, the self-hosted pipeline isn't a cost optimization — it's a hard requirement.

The failure mode to recognize early: high-cardinality, high-volume services. If any single service is emitting millions of log lines per minute, the embedding step will saturate before the rest of the pipeline can keep up. bge-m3 on a GPU can process batches quickly, but you're still talking about a sequential bottleneck unless you place a durable queue — Redis Streams, NATS JetStream, or similar — in front of the embedding workers and run multiple consumers. That's a non-trivial architectural addition. Without it, backpressure accumulates silently and you start missing anomalies during the exact spikes you most want to catch. If you're already operating at that volume, purpose-built log infrastructure with horizontal scaling baked in is the more honest choice.

One combination that works surprisingly well in practice: if your team is already using local models for development assistance — the workflow covered in AI Coding Tools in 2026: Cloud Copilots vs Local Models — the same hardware running code completions during the day can run log fingerprinting on off-peak cycles. The 32GB VRAM box that serves a coding assistant doesn't sit idle overnight. You're amortizing the hardware cost across both workloads, and the operational mental model transfers directly: the same Ollama instance, the same embedding model, just pointed at log chunks instead of code context. That overlap makes the ops overhead of maintaining the pipeline much lower than it looks on paper.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Nginx Proxy Manager on Your Home Lab: Performance Tuning Beyond the Defaults

우병수 — Wed, 08 Jul 2026 08:11:09 +0000

TL;DR: Nginx Proxy Manager ships with one goal: get a reverse proxy with Let's Encrypt certs running in under ten minutes. It delivers on that.

📖 Reading time: ~20 min

What's in this article

Why the Default NPM Config Leaves Performance on the Table
Getting NPM Running the Right Way in Docker
Worker and Buffer Tuning Inside the Generated nginx.conf
Proxy Host Advanced Settings That Actually Matter
SSL Performance: Let's Encrypt, Wildcard Certs, and HSTS Pitfalls
Monitoring NPM and Catching Problems Before Users Do
When NPM Is the Wrong Tool

Why the Default NPM Config Leaves Performance on the Table

Nginx Proxy Manager ships with one goal: get a reverse proxy with Let's Encrypt certs running in under ten minutes. It delivers on that. What it doesn't deliver is a config that holds up under anything resembling real traffic. The defaults reflect shared-hosting assumptions — conservative buffer sizes, short keepalive windows, worker counts that don't account for your actual CPU topology. On a machine you own, those aren't safe defaults, they're just wrong defaults.

The symptoms are recognizable once you know what to look for. Upstream timeouts on long-running API responses — say, a local Ollama inference call or an OpenAI streaming response that takes 45 seconds — because proxy_read_timeout ships at 60s and NPM's UI doesn't expose it per-host without a custom config block. 502s during n8n webhook bursts because the upstream queue fills faster than the default buffer can drain. And SSL handshake latency that eats 80–150ms on every short request because session resumption isn't configured and the TLS ticket key rotation is left at defaults. That last one is invisible in synthetic benchmarks but shows up immediately when you're proxying lots of small API calls.

The underlying issue is that NPM wraps nginx in a management layer — which is genuinely useful — but it also abstracts away the knobs that matter. The generated nginx.conf in a stock Docker deployment looks roughly like this:

worker_processes auto;
# "auto" resolves to 1 on a 1-vCPU container — fine for a VPS,
# wrong for a 16-core workstation running in Docker with --cpus not set
worker_connections 1024;
# 1024 total across all workers; saturates fast under webhook fan-out

http {
  # no proxy_cache_path defined — caching is entirely disabled
  # keepalive_timeout 75s — upstream keepalives not configured at all
  # client_max_body_size 1m — will silently drop file uploads
  # gzip off — yes, off by default
}

What this article works through: the specific /data/nginx/custom override files that NPM actually reads, per-proxy-host advanced config blocks that survive container restarts, how to set proxy_read_timeout and proxy_send_timeout high enough for LLM API responses without opening yourself up to connection exhaustion, and how to wire up upstream keepalives so n8n webhook throughput stops degrading under burst load. Everything here runs on a Docker Compose stack — NPM 2.x on top of nginx 1.25 — so the file paths and config injection points are concrete and reproducible.

Getting NPM Running the Right Way in Docker

The volume mount decision trips up more NPM deployments than any config mistake. When you bind-mount /path/on/host/data directly on an ext4 filesystem and NPM starts hammering Let's Encrypt renewals plus proxy host writes simultaneously, you can hit inode exhaustion or contention-driven write stalls — especially on VPS images provisioned with small inode counts. Named volumes let Docker manage that I/O through its own storage driver layer, which sidesteps the problem entirely. The fix is one line in your compose file, and it costs nothing.

Here's a compose file that avoids the common traps:

services:
  npm:
    image: jc21/nginx-proxy-manager:2.11.3   # pinned — not latest
    container_name: npm
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
      - "81:81"
    environment:
      DB_SQLITE_FILE: "/data/database.sqlite"
      DISABLE_IPV6: "true"           # drop this only if your LAN actually routes v6
      X_FRAME_OPTIONS: "sameorigin"  # default is DENY, which breaks iframe embeds
    volumes:
      - npm_data:/data
      - npm_letsencrypt:/etc/letsencrypt
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:81/api"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 20s

volumes:
  npm_data:
  npm_letsencrypt:

Pin the image. Between 2.10.x and 2.11.x, upstream NPM changed how it stores custom Nginx config snippets in SQLite, and containers that got auto-pulled to a new minor broke existing advanced proxy host configs silently — the proxy kept running from the old rendered config on disk, but any edit triggered a regeneration that dropped custom directives. You won't catch that until you touch a host entry. 2.11.3 is the last version I've validated on my stack; check the release notes before bumping.

DISABLE_IPV6: "true" is worth explaining because the NPM docs barely surface it. If your Docker host has IPv6 disabled at the kernel level (net.ipv6.conf.all.disable_ipv6 = 1 in sysctl), Nginx will still try to bind [::]:80 and [::]:443 by default — and it will fail silently on startup in a way that leaves your plain HTTP proxy hosts unreachable while HTTPS ones work. The symptom looks like a routing problem, not a bind failure. Setting this env var removes the v6 listen directives from the generated config before Nginx ever starts. X_FRAME_OPTIONS is a different category of gotcha: NPM sets DENY by default as a security header, which is correct for most traffic, but if you're embedding Grafana, Homepage, or any self-hosted dashboard in an iframe through the proxy, every embed will be blocked. Override to sameorigin or your specific upstream value rather than disabling it wholesale.

Don't rely on Docker's built-in TCP healthcheck against port 81 to decide if NPM is ready for depends_on chains. The admin UI port binds before the proxy engine finishes loading its host configs and writing Nginx conf files. The correct check is against the API endpoint:

# Run this manually to confirm the proxy engine is actually up
docker exec npm curl -sf http://localhost:81/api

# Expected output when healthy — any non-empty 200 response:
# {"status":"OK"}

# If this returns nothing or a connection refused,
# Nginx hasn't finished initializing even if port 81 is open

That -sf flag combination is important: -s silences the progress output so only real failures produce noise, and -f makes curl return a non-zero exit code on HTTP 4xx/5xx — Docker's healthcheck mechanism requires a non-zero exit to register as unhealthy. The start_period: 20s in the compose healthcheck above gives NPM time to do its initial SQLite migration on first boot without immediately triggering restart loops on a slow host.

Worker and Buffer Tuning Inside the Generated nginx.conf

NPM regenerates /etc/nginx/nginx.conf on every container restart and every time you save a proxy host — the file is not yours to own. Edit it directly and your changes vanish the next time the UI touches anything. The correct injection points are the Custom Nginx Configuration textarea inside each proxy host's Advanced tab, or files dropped into /data/nginx/custom/ on the host. Files in that directory get included automatically; NPM ships with hooks for http_top.conf and events.conf that map to exactly the blocks you need to tune.

Worker and connection limits belong in /data/nginx/custom/http_top.conf. The generated config inherits worker_processes 4 regardless of your actual core count, and the default worker_connections ceiling will cap you well before your NIC does. Drop this in:

# /data/nginx/custom/http_top.conf
# worker_processes goes outside http{} — NPM's include point handles placement
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    multi_accept on;      # accept all pending connections per epoll wakeup
    use epoll;            # explicit on Linux; don't leave it to autodetect
}

worker_rlimit_nofile matters more than most operators realize. Each proxied connection consumes two file descriptors — one client-side, one upstream-side — so a 4096 worker_connections ceiling with the default 1024 nofile limit will silently drop connections under load before nginx logs anything useful. Set the system ulimit too: add LimitNOFILE=65535 to the Docker service unit or your compose override.

Buffer tuning is where NPM setups running Ollama or large API backends fall apart. The defaults assume small web responses. When proxying Ollama's /api/chat streaming endpoint, nginx will start spilling response chunks to disk the moment they exceed the in-memory buffer allocation — and disk spill shows up as inexplicable latency spikes, not errors. Add this to the same custom file or to a specific proxy host's Advanced block:

# Prevents disk-spill on large/streaming API responses
proxy_buffer_size      16k;    # holds the response headers + first chunk
proxy_buffers          8 16k;  # 8 × 16k = 128k total in-memory budget per request
proxy_busy_buffers_size 32k;   # max handed to client while rest is still being read
proxy_request_buffering off;   # critical for streaming — don't buffer the full request body

For n8n webhook backends that get hit repeatedly by the same automation loop, the TCP handshake cost accumulates fast. Keepalive on the upstream block removes it:

upstream n8n_backend {
    server 127.0.0.1:5678;
    keepalive 32;              # pool of 32 idle keepalive connections to the backend
}

server {
    # inside your proxy host's advanced config
    keepalive_timeout   75s;
    keepalive_requests  1000;  # before nginx forces a connection recycle
}

Gzip belongs in http_top.conf globally rather than per-host, because you want it covering JSON API responses from every backend — not just the ones you remembered to configure. Level 4 is the practical ceiling on modern x86 hardware; beyond it, CPU cost grows faster than byte savings shrink, and you gain nothing on already-small payloads that the gzip_min_length gate handles anyway:

gzip on;
gzip_comp_level  4;
gzip_min_length  1024;   # don't bother compressing tiny responses
gzip_proxied     any;    # compress responses to reverse-proxied clients too
gzip_vary        on;     # tells CDNs the response varies by Accept-Encoding
gzip_types
    text/plain
    text/css
    application/json
    application/javascript
    application/x-javascript
    text/xml
    application/xml;

One gotcha: gzip_proxied any is off by default and the NPM UI gives no hint it exists. Without it, responses to clients that came through another proxy layer — common when NPM sits behind Cloudflare — won't be compressed even if the client sends Accept-Encoding: gzip. The gzip_vary on line is equally important if a CDN or Varnish layer sits in front; without it, a compressed response can get cached and served to a client that never declared gzip support.

Proxy Host Advanced Settings That Actually Matter

The most underused field in the NPM UI is the Custom Nginx Configuration box on each proxy host's Advanced tab. Whatever you put there gets injected directly into the location / block of the generated config — and that's where you need to be for anything beyond basic reverse proxying. The one I add to every AI inference endpoint without exception: proxy_read_timeout 300s;. The default is 60 seconds. A locally-served LLM generating a 2,000-token response with a quantized 13B model on modest hardware will routinely take longer than that, and Nginx silently closes the connection and returns a 504 to the client mid-stream. No error in your app logs, just a truncated response and a confused user. Five seconds of config work eliminates the entire class of problem.

# Custom Nginx Configuration (per proxy host, Advanced tab)
# Drop this in for any LLM or long-poll backend

proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_connect_timeout 10s;   # keep short — see note on health checks below
proxy_buffering off;         # required for SSE / streaming token output

NPM's WebSocket toggle (the "Websockets Support" checkbox in the UI) does not emit proxy_http_version 1.1;. That's the silent killer. Without it, Nginx defaults to HTTP/1.0 for upstream connections, which doesn't support persistent connections. You'll see intermittent WebSocket drops, failed upgrade handshakes on certain backends, or keepalive connections that close immediately. The fix is to ignore the checkbox entirely and write the full block yourself in the Advanced tab:

proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;

Rate limiting requires two separate config locations, which is the part NPM's UI makes awkward. The zone declaration — limit_req_zone $binary_remote_addr zone=api:10m rate=30r/m; — must go in NPM's global custom config, which lives under Settings → Nginx. That config gets included at the http {} block level. Then on each individual proxy host that should enforce the limit, the Advanced tab gets limit_req zone=api burst=10 nodelay;. The nodelay flag matters here: without it, requests arriving within the burst window get queued and delayed rather than rejected immediately, which can cause a pile-up if a runaway n8n workflow or a misbehaving API client starts hammering a local Ollama endpoint. With nodelay, burst capacity is consumed instantly and anything over it gets a clean 503.

# Settings → Nginx (global custom config, http block level)
limit_req_zone $binary_remote_addr zone=api:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=ui:10m rate=120r/m;

# Per-host Advanced tab (enforcement point)
limit_req zone=api burst=10 nodelay;
limit_req_status 429;   # return proper 429 instead of default 503

The missing upstream health check is a real operational gap. The open-source NPM build has no equivalent of Nginx Plus's health_check directive or HAProxy's backend polling — when a container goes down, NPM keeps routing traffic to the dead upstream until the connection attempt times out. The default proxy_connect_timeout is 60 seconds, which means every request to a failed backend hangs for a full minute before the client gets an error. Setting proxy_connect_timeout 5s; per host cuts that failure surface down immediately. Pair it with a short proxy_read_timeout appropriate to the endpoint, and bad backends surface fast enough that monitoring (even a basic Uptime Kuma instance polling the proxied URL) will catch them before users notice a pattern.

SSL Performance: Let's Encrypt, Wildcard Certs, and HSTS Pitfalls

The wildcard cert via DNS challenge is one of those setups that feels like extra work until the third time you add a new subdomain and realize you didn't touch Certbot at all. With NPM's built-in Let's Encrypt integration, you configure the DNS provider once — Cloudflare or Route53 both have first-class support — drop in an API token, and every subdomain you spin up afterward is covered automatically. More practically for home labs: the DNS-01 challenge never requires port 80 to be reachable. If your ISP quietly blocks inbound port 80 (common on residential connections), the HTTP-01 challenge fails silently and you end up debugging NPM logs instead of realizing the port is the problem. DNS-01 sidesteps that entirely. Renewal happens on a schedule against the DNS API, not against your server's open ports.

OCSP stapling is disabled in NPM's default nginx template, and that's a consistent 100–200ms tax on every cold TLS handshake. The fix goes in NPM's "Advanced" custom config block for each proxy host:

ssl_stapling on;
ssl_stapling_verify on;
# resolver needs to be reachable from the nginx process — not just from the host OS
resolver 1.1.1.1 valid=60s;
resolver_timeout 5s;

The resolver directive is the one people skip and then wonder why stapling silently fails. Nginx needs to independently resolve the OCSP responder URL from within the container, and if you don't give it a working resolver, ssl_stapling logs a warning and falls back to no stapling. You can verify stapling is active with openssl s_client -connect your.domain:443 -status — look for OCSP Response Status: successful in the output. If you see no response sent, the resolver is the first thing to check.

The HSTS max-age default in NPM is 63072000 seconds — exactly two years. That number is fine for a stable production host, but if you toggle HSTS on during initial setup and your cert or config has any issue requiring a fallback to plain HTTP, every browser that visited during that window will refuse HTTP connections for two years. There's no server-side fix; the only recovery is a browser-by-browser manual HSTS cache clear. During testing, override the header in the custom config block:

# In NPM's Advanced tab — overrides the HSTS checkbox value
add_header Strict-Transport-Security "max-age=86400" always;

Promote to max-age=63072000; includeSubDomains; preload only after the host has been stable for at least a week with no certificate errors. The preload flag is a separate commitment — submitting to the HSTS preload list is effectively permanent and can't be reversed quickly, so don't add it to internal or experimental hosts.

HTTP/2 ships enabled in NPM's SSL config, but "enabled" and "working end-to-end" aren't the same thing. Some upstream services — particularly older Java apps, Ruby Rack apps, or anything running behind a second nginx with an incompatible config — cause NPM to silently negotiate HTTP/1.1 on the upstream connection while still advertising HTTP/2 to clients. That's usually fine, but occasionally it causes header handling issues. Check the client-facing protocol directly:

curl -I --http2 https://your.domain 2>&1 | head -5
# Healthy output starts with:
# HTTP/2 200

If you see HTTP/1.1 200 instead, NPM either isn't presenting HTTP/2 in the ALPN negotiation or something upstream is forcing a downgrade. Check the proxy host's SSL certificate status first — an expired or mismatched cert causes curl to fall back before ALPN happens. If the cert is valid and you're still getting HTTP/1.1, look at whether the upstream proxy_pass target is using https:// with a self-signed cert that nginx can't verify; that can stall the connection upgrade path in ways that don't surface as obvious errors in NPM's logs.

Monitoring NPM and Catching Problems Before Users Do

The most actionable signal NPM gives you is buried in a field most operators ignore: the upstream response time in the access log. By default, NPM writes logs to /data/logs/ — one subdirectory per proxy host, with access.log and error.log side by side. Mount that directory as a Docker volume and you have everything you need to catch degrading backends before a user files a ticket.

If you're already running a Loki/Grafana stack, the setup is straightforward: point Promtail at /data/logs/*/access.log with a pipeline stage that parses the upstream response time field, then build a Grafana alert on p95 latency per host. If you're not running Loki yet, the lowest-effort option that still beats nothing is a simple shell watcher on the host:

# tail all NPM access logs and flag anything over 2s upstream response
tail -F /data/logs/*/access.log | awk '
  {
    # NPM default log format puts upstream_response_time in $NF area
    # adjust field index to match your actual log format
    if ($0 ~ /upstream_response_time/) {
      match($0, /upstream_response_time: ([0-9.]+)/, arr)
      if (arr[1]+0 > 2.0) print "SLOW: " $0
    }
  }
'

Rough, but it fires immediately and requires no additional infrastructure. When you do move to Loki, the upstream_response_time field becomes a proper metric-from-log that tells you exactly which backend is dragging.

For uptime, I run this on a PM2 cron every 60 seconds across every service that matters. No external dependency, no ping fee, just a dead-simple shell one-liner that logs HTTP status and total time:

# In ecosystem.config.js, one entry per critical service
{
  name: "healthcheck-myapp",
  script: "bash",
  args: "-c \"echo $(date -u +%FT%T) $(curl -o /dev/null -sw '%{http_code} %{time_total}s' https://myapp.internal/health) >> /var/log/healthchecks/myapp.log\"",
  cron_restart: "* * * * *",
  autorestart: false,
  watch: false
}

Healthy services respond in under 100ms on a local network. Anything logging above 800ms consistently is either a backend problem or an NPM configuration issue — typically a missing keepalive or an oversized buffer setting forcing full response buffering before the first byte goes out.

The 502 vs 504 distinction is where most operators waste time. A 502 Bad Gateway from NPM means the upstream TCP connection was refused — the process isn't listening. The correct response is a restart, full stop. A 504 Gateway Timeout means the upstream accepted the connection but didn't respond within the timeout window — the process is alive but stalled, usually under load or stuck on a downstream dependency like a database lock. Restarting a 504 without understanding why it stalled just delays the same failure by a few minutes. Watch for the transition pattern in your error logs: if a host starts flipping 502 → 504 → 502 in a short window, the process is crash-looping under load, which is a different problem entirely from either one in isolation. Set NPM's proxy timeout (proxy_read_timeout in the advanced config block) to something deliberate — the default 60 seconds means a stalled upstream holds an NPM worker thread for a full minute before the 504 fires.

One area that gets skipped in almost every NPM hardening guide: the proxy layer's interaction with AI tooling and local model infrastructure. Streaming inference responses, long-polling completions, and auth header passthrough all have sharp edges at the reverse proxy level — especially when your local model setup expects specific timeout and buffer behaviors. The AI Coding Tools in 2026: Cloud Copilots vs Local Models guide covers how local-model setups interact with reverse proxy constraints like auth headers and streaming timeouts, which is worth reading before you wire an Ollama endpoint or OpenAI-compatible server through NPM for the first time.

When NPM Is the Wrong Tool

The 30-proxy-host threshold is roughly where NPM's SQLite backend starts working against you rather than for you. Below that, the GUI is a genuine time-saver. Above it, you're fighting drift: a host edited in the UI doesn't show up in Git, rollbacks mean restoring a SQLite file, and your "config" is effectively a database dump. If you're already running infrastructure-as-code for everything else — Terraform, Ansible, Docker Compose in a repo — NPM becomes the odd one out that you can't peer-review or diff. Caddy with a committed Caddyfile or Traefik driven by Docker labels both solve this cleanly. A git diff on a Caddyfile shows exactly what changed and when; NPM's export JSON does not.

# Caddy equivalent of a typical NPM reverse proxy entry — version-controllable, reviewable
app.example.com {
    reverse_proxy localhost:3000 {
        header_up Host {host}
        header_up X-Real-IP {remote_host}
    }
    tls your@email.com   # ACME handled inline, no separate certbot process
}

The Streams tab in NPM exposes TCP/UDP proxying, and it works for basic cases — forwarding a Minecraft port or a WireGuard UDP endpoint is straightforward. What you don't get is any tuning surface: no proxy_timeout, no proxy_buffer_size, no proxy_connect_timeout, no way to set so_keepalive on the upstream socket. For a game server where a 200ms TCP timeout difference is felt by players, or a Postgres tunnel where you want fine-grained keepalive behavior, you need a raw nginx container with a hand-written stream {} block. That's not a workaround — it's the right architecture for the use case.

# stream block nginx config for a latency-sensitive TCP tunnel
# mount this as /etc/nginx/nginx.conf in a plain nginx:alpine container
stream {
    upstream db_backend {
        server 10.0.0.5:5432;
    }

    server {
        listen 5432;
        proxy_pass db_backend;
        proxy_timeout 10s;          # fail fast — don't let stale connections pile up
        proxy_connect_timeout 2s;
        proxy_buffer_size 16k;      # default 16k is fine for PG, tune down for game protocols
    }
}

The process count is the other honest liability. NPM runs nginx, a Node.js GUI server, SQLite, and a Certbot/openssl-backed certificate renewal loop. On a 4GB homelab VM that's background noise. On a device with 512MB or 768MB RAM — an older Raspberry Pi, an edge node, a cheap VPS — that overhead is measurable in available headroom for your actual workloads. Caddy is a single statically-linked binary that handles TLS automatically; bare nginx is even leaner. For edge nodes or constrained devices, install Caddy via its official package and skip the container stack entirely:

# Caddy on a Debian/Ubuntu edge node — no Docker, no secondary processes
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update && sudo apt install caddy
# Binary is ~45MB, zero runtime dependencies, TLS built in

The honest summary: NPM earns its keep for solo operators managing a moderate number of HTTPS reverse proxies who want a visual cert dashboard and don't need repeatability guarantees. Push past that boundary in any direction — scale, IaC discipline, non-HTTP protocols, constrained hardware — and the abstraction layer NPM adds becomes friction rather than utility. Knowing which side of that line your setup sits on saves you from bolting on workarounds that never quite fit.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Building a Local Dashboard with Homepage: One Config File to Rule Your Self-Hosted Stack

우병수 — Mon, 06 Jul 2026 08:10:40 +0000

TL;DR: The moment you cross about a dozen self-hosted services, browser bookmarks stop working as an ops strategy. You end up with a mental map that looks something like this: Ollama is on :11434, n8n is on :5678, Grafana on :3000, Portainer on :9000, your WordPress instance so

📖 Reading time: ~22 min

What's in this article

The Problem: Thirty Services, Zero Visibility
What Homepage Actually Is (and What It Isn't)
Getting It Running: Docker Compose and Directory Layout
Configuring Services, Widgets, and Integrations
Putting It Behind a Reverse Proxy with Auth
Non-Obvious Behaviors That Will Cost You Time
When Homepage Earns Its Place (and When It Doesn't)

The Problem: Thirty Services, Zero Visibility

The moment you cross about a dozen self-hosted services, browser bookmarks stop working as an ops strategy. You end up with a mental map that looks something like this: Ollama is on :11434, n8n is on :5678, Grafana on :3000, Portainer on :9000, your WordPress instance somewhere behind an nginx proxy, and three other things you added last month and have already half-forgotten. None of those tabs tell you whether the service is actually responding. A bookmark to http://localhost:8080 is equally useful whether the container is healthy or crashed three hours ago.

That gap between "has a port" and "is actually running" is where silent failures live. My n8n flows pull from a local embedding API, send results to a Postgres instance, and occasionally hit Ollama for inference. If any one of those services dies quietly — no alert, no log you happened to be watching — the pipeline just stops producing output. Without a dashboard showing live health checks, the failure mode is: you notice something downstream is wrong, you start manually curling endpoints, you eventually find the dead container. That whole process takes ten to thirty minutes. A dashboard with a working health probe turns it into a ten-second glance.

The alternatives are real options worth knowing. Heimdall is polished and has a decent app library, but it stores all its state in a SQLite database you have to back up and restore manually on a rebuild. Flame is minimal and fast but the integration ecosystem is thin — you get links and icons, not live service metrics. Dasherr is clean for simple setups. Homepage wins the comparison specifically because your entire configuration lives in YAML files that you can version-control and copy verbatim onto a new machine. There's no GUI state to recreate. Rebuilding the host means docker compose up -d plus your config directory, and you're back to exactly where you were. That property matters more the more services you're running, because the dashboard itself becomes infrastructure, not a convenience.

Homepage also ships with first-party integrations for the services that actually matter in a self-hosted stack — Portainer, Sonarr, Radarr, Proxmox, various *arr apps, and generic API widgets for anything custom. Those integrations pull live data, not just link out. The difference between a link tile and a widget showing current queue depth or CPU utilization is the difference between a bookmark manager and an actual operator panel. For context on how a dashboard fits into a broader self-hosted automation workflow, see our guide on Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines.

What Homepage Actually Is (and What It Isn't)

Homepage reads YAML files off disk and renders a dashboard — no database, no migrations, no state to back up beyond the config directory. That sounds like a minor implementation detail until you realize it means your entire dashboard is just a folder you can commit to git, rsync to a new host, or roll back with git checkout. The "static-feeling" part of the description is apt: the Next.js app does server-side rendering but feels like a static page to the user because there's no session, no login wall (by default), and no write path through the UI. You configure it entirely by editing files.

The integration model is polling, not proxying. When you add a Sonarr widget, Homepage makes an authenticated GET request from the server side to your Sonarr API using the key you supply in the config. It renders the result into the widget. That's it. No traffic flows through Homepage to reach Sonarr — a browser tab to Sonarr still goes directly to Sonarr. This matters for two reasons: widget data can be stale by a few seconds (the poll interval is configurable but defaults are conservative), and if your Sonarr instance is on an internal network segment unreachable from the Homepage container, the widget will silently fail rather than route around it. The network path is container → target service, not browser → Homepage → target service.

The most common setup mistake I see documented in forums: people deploy Homepage and then expect it to handle reverse proxying or authentication. It does neither. Homepage has no concept of protecting a downstream service, issuing tokens, or acting as an ingress point. If you want sonarr.yourdomain.com to require a login, that's a job for Nginx Proxy Manager, Traefik, or Authelia — Homepage just shows you a link and a widget. Conflating these tools leads to configs where Homepage is exposed to the internet with no auth layer in front of it, which is a real exposure if you're embedding API keys in the widget data that gets rendered client-side. Keep Homepage on an internal interface or behind an auth-aware reverse proxy.

Resource footprint is genuinely light. The container sits at roughly 80–120 MB RSS at idle on my workstation, CPU is near zero between renders, and the image pulls at around 200 MB compressed. You can co-locate it comfortably on a host running heavier workloads — it won't compete with Ollama for RAM or a Postgres instance for I/O. The one thing that can spike memory is loading a dashboard with many widgets that all poll simultaneously on startup; that settles within a few seconds. For a Raspberry Pi 4 or a low-spec VPS, this is one of the few dashboard options that won't feel punishing.

Getting It Running: Docker Compose and Directory Layout

The first surprise for most people deploying Homepage is that there's no database, no migration step, and no admin UI to configure. Everything is flat YAML files in a single config directory. That's a feature, not a limitation — it means your entire dashboard is version-controllable and reproducible from scratch in under a minute. But it also means if you skip the directory layout step, Homepage silently renders a blank page with zero indication of what went wrong.

Pin the image. ghcr.io/gethomepage/homepage:latest will work, but a specific tag like v0.9.2 means you won't wake up to a broken dashboard after an unattended pull. Here's a compose file that covers the essentials:

services:
  homepage:
    image: ghcr.io/gethomepage/homepage:v0.9.2
    container_name: homepage
    ports:
      - "3000:3000"
    volumes:
      # config directory must exist on host before first boot
      - /opt/homepage/config:/app/config
      # optional: drop custom PNG/SVG icons here, reference as /icons/myapp.png
      - /opt/homepage/icons:/app/public/icons
      # socket mount — read the trade-off discussion below before enabling this
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      # "stdout" keeps logs in docker logs output; "file" writes to /app/config/logs/
      LOG_TARGETS: stdout
      # set your local timezone or widget times will be UTC
      TZ: America/New_York
    restart: unless-stopped

Before docker compose up does anything useful, the five config files need to exist. Homepage won't create them. Each one owns a distinct slice of the UI:

services.yaml — the main grid of service cards, grouped into named sections. This is where your Jellyfin, Grafana, and Gitea entries live.
bookmarks.yaml — a separate column of plain links with no status checks. Good for external URLs you want nearby without the overhead of a widget.
widgets.yaml — top-bar info widgets: system stats, weather, search bar, date/time. Configured independently from service cards.
settings.yaml — global layout options: color theme, column count, background image, favicon, custom CSS injection point.
docker.yaml — connection definitions for Docker hosts (local socket or remote TCP). Homepage references these by name inside services.yaml when doing container auto-discovery.

Create all five as empty files before first boot: touch /opt/homepage/config/{services,bookmarks,widgets,settings,docker}.yaml. Homepage parses them on startup and on every file change via a filesystem watcher — no restart required when you edit. If any file has a YAML syntax error, the watcher logs the parse failure to stdout but continues serving the last valid state. That's why checking logs immediately after first boot matters: docker logs homepage --follow for the first 30 seconds will surface any parse errors before you start adding real entries and wondering why nothing appears.

The Docker socket mount is the sharpest edge in this setup. Mounting /var/run/docker.sock read-only gives Homepage the ability to query container metadata, which powers the label-based auto-discovery (homepage.name, homepage.href, etc.). Read-only helps, but the socket itself has no concept of read-only enforcement at the API level — a process with socket access can issue any Docker API call regardless of the mount flag. That's root-equivalent on the host. For a homelab running on a single machine where you trust everything in the stack, the risk is contained. For anything exposed to the network or running untrusted containers, the mitigation worth deploying is tecnativa/docker-socket-proxy. It runs a slim HAProxy instance that sits between Homepage and the real socket, whitelisting only the API endpoints Homepage actually needs (GET /containers, essentially). You'd replace the socket mount with a TCP connection to the proxy container in docker.yaml. The proxy itself gets the socket mount and runs with network_mode: none so it can't initiate outbound connections. It's two extra lines in compose and meaningfully reduces the blast radius if Homepage ever has a supply-chain issue.

First-boot verification is a three-step check: hit http://localhost:3000 and confirm the default Homepage layout renders (it should show empty sections if your YAML files are valid but empty), run docker logs homepage 2>&1 | grep -i error to confirm zero parse failures, and if you mounted the socket, run docker logs homepage 2>&1 | grep -i docker to confirm it connected to the Docker API successfully. If the page loads but is completely blank rather than showing an empty layout, the most common cause is a missing config directory or a permissions mismatch — Homepage runs as UID 1000 by default, so chown -R 1000:1000 /opt/homepage/config fixes it without needing to add user: overrides to the compose file.

Configuring Services, Widgets, and Integrations

The gap between Homepage looking like a browser start page and actually functioning as a live operations panel is almost entirely in services.yaml. Most guides stop at showing icons in a grid. This one doesn't.

services.yaml Anatomy

The file is a YAML list of groups. Each group is a named key containing a list of service objects. That hierarchy — group → services — is the whole structure. Here's a concrete example with n8n and Portainer that actually works:

- Automation:
    - n8n:
        href: http://192.168.1.10:5678
        description: "Workflow automation engine"
        icon: n8n.png
        server: my-docker
        container: n8n

    - Portainer:
        href: http://192.168.1.10:9000
        description: "Docker management UI"
        icon: portainer.png
        widget:
          type: portainer
          url: http://portainer:9000
          env: 1
          # env refers to the Portainer environment ID, not an env var
          key: "{{HOMEPAGE_VAR_PORTAINER_KEY}}"

The icon field accepts either a slug from the dashboard-icons repo (just the filename like n8n.png) or a path like /icons/custom-thing.png if you've mounted a local directory. Homepage resolves the slug against its bundled icon CDN automatically. The server and container fields on the n8n entry enable the Docker integration — if you've configured a Docker socket in docker.yaml, the tile will show the container's running/stopped state without any widget block.

Uptime Kuma Widget Deep-Dive

The Uptime Kuma widget is one of the cleaner first-party integrations. It hits the Kuma API and surfaces up/down monitor counts plus average response time across all monitors in a given status page slug. The config looks like this:

- Monitoring:
    - Uptime Kuma:
        href: http://192.168.1.10:3001
        description: "Service uptime monitoring"
        icon: uptime-kuma.png
        widget:
          type: uptimekuma
          url: http://uptime-kuma:3001
          slug: homelab
          # "slug" must match the status page slug you created in Kuma's UI
          # under Status Pages → your page → the URL segment after /status/

There is no API key field on this widget — Kuma's status page endpoint is intentionally public if you've created a status page. The slug field is the part people get wrong most often: it's not the monitor name, it's the URL slug of a status page you've set up inside Kuma. Go to Status Pages in the Kuma UI, create one, add your monitors to it, and use whatever you put in the "path" field. The widget will then display total monitors, how many are up, how many are down, and aggregate response time. If it shows zeros or errors, the slug doesn't match or the status page is set to private.

Connecting Homepage to Ollama via Custom API Widget

There is no first-party Ollama widget in Homepage as of early 2026. The right move is the customapi widget type, which does a GET against any JSON endpoint and lets you extract fields with a selector. Ollama's /api/tags endpoint returns a JSON object with a models array — one entry per model currently pulled. You can surface the count like this:

- AI:
    - Ollama:
        href: http://192.168.1.10:11434
        description: "Local LLM runtime"
        icon: ollama.png
        widget:
          type: customapi
          url: http://ollama:11434/api/tags
          refreshInterval: 30000
          # milliseconds — 30s is fine, model list doesn't change constantly
          mappings:
            - field:
                models: length
              label: Models Loaded
              format: number

The field block navigates the JSON response. models: length tells Homepage to take the models array and return its length rather than trying to display the array itself. If you want to go deeper — say, show the name of the first loaded model — you'd use models: 0 as a nested accessor and then name under it, but the length display is the most useful at a glance. The refreshInterval is in milliseconds; the default without it is around 10 seconds, which hammers Ollama's API needlessly.

Environment Variable Substitution

Homepage supports {{HOMEPAGE_VAR_*}} substitution in all YAML config files. Any variable prefixed exactly with HOMEPAGE_VAR_ in the environment gets injected at parse time. This keeps API keys out of committed config. The pattern in practice:

# .env file alongside your docker-compose.yaml
HOMEPAGE_VAR_PORTAINER_KEY=ptr_abc123yourkeyhere
HOMEPAGE_VAR_UPTIME_KUMA_TOKEN=uk1_notneededforthiswidgetbutshownforpattern

# docker-compose.yaml snippet — the volume mount is required
services:
  homepage:
    image: ghcr.io/gethomepage/homepage:latest
    env_file:
      - .env
    volumes:
      - ./config:/app/config
      # Homepage reads HOMEPAGE_VAR_* from the process environment,
      # NOT from a mounted .env — the env_file directive loads them
      # into the container's environment at startup

A common failure mode here: people mount the .env file as a volume file instead of using env_file, then wonder why substitution doesn't work. Homepage's variable substitution reads from process environment variables, not from files on disk. The env_file key in Compose handles the injection correctly. Also worth knowing: if the variable is missing at startup, Homepage renders the literal string {{HOMEPAGE_VAR_PORTAINER_KEY}} in the config — which will cause the widget to fail silently with an auth error rather than a parse error, so check your container logs rather than the browser console when debugging.

Putting It Behind a Reverse Proxy with Auth

Homepage ships with zero authentication. That's not a criticism — it's a deliberate design choice that puts the auth problem exactly where it belongs: at the proxy layer. The problem is that a lot of people skip that step because it's on a "trusted" home network. Your LAN is not as flat as you think. Any device on it — a smart TV, a guest phone, a compromised IoT gadget — can hit an unauthenticated Homepage instance and immediately read your entire service topology, plus whatever API tokens you've embedded in widget configs. Those tokens are served inline in the page HTML. They're not hidden behind a JS variable or an env lookup at runtime; they're in the source. One curl command from any device on your subnet and someone has your Portainer token, your Grafana service account key, and your Sonarr API key.

Nginx Proxy Manager is the path of least resistance here. The proxy host setup for Homepage is standard except for one toggle that catches people off guard: WebSockets Support must be enabled. Homepage uses WebSocket connections to stream live widget data — CPU graphs, download speeds, container states. Without it, the page loads but widgets silently fail to update and you end up staring at stale numbers with no error message explaining why. Set the proxy host to forward to your Homepage container on port 3000, flip the WebSockets toggle, and set your custom locations to /. For basic protection without a full SSO stack, NPM's Access List feature lets you allowlist specific IP ranges — useful if you want to lock the dashboard to a single machine's IP or a narrow subnet:

# NPM Access List entry — blocks everything except your admin workstation
# and the local subnet you actually use for management
Allow: 192.168.1.50    # your main workstation
Allow: 10.0.0.0/24    # management VLAN if you have one
Deny: all

IP allowlisting is a speed bump, not a wall — it fails the moment you're on the wrong device or VLAN. For real auth, Authelia and Authentik both work cleanly with Homepage, and the integration requires exactly zero Homepage-side configuration. Both systems operate as forward-auth middleware at the proxy level. In Nginx Proxy Manager with Authelia, you add a single snippet to the Advanced tab of the proxy host:

# NPM Advanced tab — Authelia forward auth for Homepage
# Authelia must be reachable at this internal URL from the proxy container
auth_request /authelia;
auth_request_set $target_url $scheme://$http_host$request_uri;
error_page 401 =302 https://auth.home.arpa/?rd=$target_url;

location = /authelia {
    internal;
    proxy_pass http://authelia:9091/api/verify;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-URL $scheme://$http_host$request_uri;
    proxy_set_header X-Forwarded-Method $request_method;
}

One thing worth being explicit about: neither Authelia nor Authentik gives you per-user views inside Homepage. The dashboard has no user model. Auth here means "authenticated users see everything or nothing" — there's no concept of showing the media stack to one user and the infrastructure widgets to another. If that's a hard requirement, you'd need to run separate Homepage instances with separate configs. Most home lab operators don't need that granularity, but it's a gap to know about before you build around it.

Completing the setup properly means getting TLS onto an internal domain without opening port 80 to the internet. The correct approach is a DNS-01 ACME challenge. If your domain is managed through Cloudflare, NPM has native Cloudflare DNS challenge support — you provide an API token scoped to DNS edit permissions for the target zone, and NPM requests a wildcard cert from Let's Encrypt entirely over DNS. No inbound HTTP required. Use a real registered domain with an internal subdomain, or a home.arpa subdomain if you're comfortable managing your own CA for it. A domain like dashboard.home.arpa resolves to an RFC 8375-compliant private address — but Let's Encrypt won't issue for .arpa, so in practice most operators use something like dashboard.yourdomain.com with a private A record pointing to an internal IP. The DNS-01 flow handles cert issuance and renewal without the domain ever needing to be publicly reachable, and you end up with a valid browser-trusted cert on a dashboard that never leaves your network.

Non-Obvious Behaviors That Will Cost You Time

The most expensive mistake you'll make with Homepage isn't misconfiguring a widget — it's trusting the UI to tell you something went wrong. It won't. Drop a tab character into a YAML block, nest a widget section at the wrong depth, forget a space after a colon — Homepage silently discards the entire widget block and renders the card as a plain label with no data. The page looks fine. Nothing is red. The only way to catch it is docker logs homepage after every config change, not just a browser refresh. The parser errors do show up in the container logs, just not anywhere a user would think to look first.

# After any services.yaml or widgets.yaml edit, run this immediately
docker logs homepage --tail=50 2>&1 | grep -i "error\|warn\|failed\|parse"

# If you see something like:
# Error: bad indentation of a mapping entry at line 14, column 5
# that's your discarded widget block — the card above it in the UI looks fine

The polling interval is fixed at approximately 30 seconds and is not adjustable on a per-widget basis in the current release. This is a deliberate architectural choice, not an oversight, and there's an open GitHub discussion tracking the demand for configurable intervals (github.com/gethomepage/homepage/discussions/1599). If you're building a panel to watch a transcoding queue or monitor a replication lag, Homepage will feel sluggish — you'll be staring at 29-second-old numbers and not know it. For anything requiring sub-minute freshness, route that widget to Grafana or a custom status page. Homepage is the right tool for ambient awareness, not operational monitoring.

Icon resolution has a specific lookup order that bites air-gapped installs hard: Homepage checks your local /icons/ mount first, then falls back to the dashboard-icons CDN at https://cdn.jsdelivr.net/gh/walkxcode/dashboard-icons. If your instance sits behind a strict egress firewall or has no outbound access at all, every icon that isn't locally present silently 404s — no broken image indicator, just an empty space where the icon should be. The fix is to pre-populate the local mount before deployment. The dashboard-icons repo is public and mirrors cleanly.

# Clone the icon pack locally and mount it into the container
git clone --depth=1 https://github.com/walkxcode/dashboard-icons.git /opt/homepage/icons

# In your docker-compose.yml:
volumes:
  - /opt/homepage/icons/png:/app/public/icons  # Homepage expects flat PNGs here

# Reference in services.yaml:
- Sonarr:
    icon: sonarr.png  # resolves from local mount, never hits CDN

The Docker label auto-discovery capitalisation trap is subtle and will produce confusing duplicate groups that look intentional. The homepage.group label is case-sensitive, so Media and media produce two separate sidebar groups — Homepage doesn't normalize them. If you're running more than a handful of Compose stacks written at different times, drift happens. The only real fix is a convention enforced at the Compose file level from the start: pick Title Case or lowercase and document it in a comment at the top of every stack file. Once you have duplicate groups in the UI, you can't merge them from the dashboard — you have to hunt down every label mismatch across all your Compose files manually.

# Enforce this pattern across all Compose stacks — pick one, never mix
labels:
  homepage.group: Media          # Title Case — pick this OR lowercase, not both
  homepage.name: Sonarr
  homepage.icon: sonarr.png
  homepage.href: http://sonarr:8989

# Quick audit command to surface capitalisation drift across stacks:
grep -r "homepage.group" /opt/stacks/ | awk -F'=' '{print $2}' | sort | uniq -c | sort -rn
# Any group with count > 1 in different cases = a split group in your UI

When Homepage Earns Its Place (and When It Doesn't)

The clearest signal that Homepage fits your setup is when you've survived a host rebuild and realized how much time you wasted re-entering service URLs, icons, and widget configs by hand. If your dashboard state lives in a browser's local storage or a SQLite file you forgot to back up, the next disk failure takes everything with it. Homepage stores every piece of configuration — services, bookmarks, widget credentials, layout — in plain YAML files that you can commit to a private Git repo and never think about again. That's not a feature worth burying in a bullet list; it's the architectural decision that makes everything else about Homepage worth tolerating.

Homepage makes sense as your primary dashboard when these conditions are true for your setup:

Config portability matters more than features: All state lives in services.yaml, bookmarks.yaml, docker.yaml, widgets.yaml, and settings.yaml. Version-control those five files and a full restore is literally git clone followed by docker compose up. No re-entering 30 service URLs, no lost icon customizations.
You want native integrations without glue code: Homepage ships with built-in widgets for Sonarr, Radarr, Jellyfin, Proxmox, Portainer, Nextcloud, Adguard Home, Unifi, and a few dozen others. Each one talks directly to the service's API using credentials you drop into services.yaml. The alternative is building API polling yourself — which you don't want to do for 15 different services.
You're the only operator: Homepage has no concept of users, roles, or visibility rules. It renders the same dashboard to anyone who can reach the URL. For a single person running a home lab behind Tailscale or a VPN, that's fine — it's actually a simplification. For anything else, it's a hard wall.

The situations where Homepage actively gets in your way are just as specific. If you need per-user views — say, a family member should see Jellyfin and nothing else while you see the whole stack — Homepage cannot do that. Grafana with its folder permissions and dashboard-level auth is the better fit there. If you need real-time metric graphs with sub-second resolution, Homepage's widget polling (which runs on a configurable interval, minimum a few seconds) will feel sluggish and incomplete compared to a Prometheus scrape feeding into Grafana panels. And if your goal is to put an authentication wall in front of all your services from one place, that's Authelia or Authentik's job — Homepage has no SSO, no OAuth proxy, no session management. Treating it as an auth layer is a mistake that'll bite you the moment you expose anything to the public internet.

The config-as-code payoff is concrete enough to show directly. My full working dashboard state is five files totaling under 300 lines of YAML, sitting in a private repo. After a fresh Debian install on a replacement drive, the restore sequence is:

git clone git@github.com:youruser/homelab-configs.git ~/configs
cd ~/configs/homepage
docker compose up -d

That's it. Every service tile, every widget credential, every custom icon path, every group label — back in under two minutes, with zero manual re-entry. The compose file mounts the cloned directory directly into the container, so there's no import step and no database to restore. Compare that to any dashboard that stores config in a browser session or an embedded database you have to remember to snapshot, and the trade-off becomes obvious.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Speakr v0.8.19: What Actually Changed and Whether It's Worth Upgrading Your Self-Hosted Transcription Stack

우병수 — Fri, 03 Jul 2026 08:10:10 +0000

TL;DR: Cloud transcription pricing has a nasty compounding effect: you don't notice it until you're processing a backlog of three-hour interview recordings and the monthly invoice arrives. AssemblyAI and Deepgram both bill per audio-minute, which sounds reasonable until your podcast ar

📖 Reading time: ~18 min

What's in this article

The Problem Speakr Is Solving (And Who It's Actually For)
What v0.8.19 Actually Changed
Setup Walkthrough: Docker Compose with GPU Passthrough
Real Resource Costs on a 32GB VRAM Workstation
Connecting Speakr to n8n: Webhook Pipeline and the Timestamp Bug
Three Non-Obvious Behaviors Worth Knowing
When to Use Speakr vs. Alternatives

The Problem Speakr Is Solving (And Who It's Actually For)

Cloud transcription pricing has a nasty compounding effect: you don't notice it until you're processing a backlog of three-hour interview recordings and the monthly invoice arrives. AssemblyAI and Deepgram both bill per audio-minute, which sounds reasonable until your podcast archive hits four figures in hours, or you're running a legal firm that transcribes depositions daily. The per-minute cost isn't the only concern — both services retain audio and transcripts on their infrastructure by default, which makes them non-starters for anything covered by attorney-client privilege, HIPAA-adjacent workflows, or simply recordings that contain information you'd rather not hand to a third party's training pipeline.

Speakr sits at a specific point in the self-hosting stack: it's not raw Whisper (which you'd have to wire up yourself), and it's not a full enterprise transcription suite. What it actually gives you is a REST API in front of Whisper or faster-whisper, a persistent job queue so uploads don't block, speaker diarization hooks via pyannote.audio, and a minimal web UI for one-off uploads. That's the practical sweet spot — enough infrastructure to integrate into automation pipelines without requiring you to write the job management layer yourself. The v0.8.19 update tightens the faster-whisper integration specifically, which matters because faster-whisper's CTranslate2 backend runs noticeably leaner on VRAM than stock Whisper for the same model size.

The operators who get the most out of this are running one of a few specific workflows: interview archives where a journalist or researcher needs searchable transcripts without sending recordings to a cloud service; internal meeting pipelines where the audio never leaves the LAN; podcast post-production where chapter markers and speaker labels feed downstream into editing tools; or medical and legal shops where data residency isn't optional. If your audio is ephemeral and non-sensitive and you're transcribing a few hours a month, paying Deepgram's per-minute rate is probably fine. If any of those conditions don't hold, you're paying for a constraint you don't need.

The diarization angle is worth calling out separately. Most self-hosted Whisper wrappers skip it or bolt it on poorly. Speakr exposes diarization as a job parameter rather than a post-processing afterthought, which means speaker labels come back in the same job response as the transcript rather than requiring a second API call or manual merge. For interview recordings with two or more speakers, that single design decision is what makes the output actually usable without additional scripting. For a broader look at how transcription fits into local automation pipelines, the guide on Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines is worth a read.

What v0.8.19 Actually Changed

The most dangerous change in v0.8.19 isn't the one getting the most attention. The webhook payload schema quietly flipped segments[].start (and segments[].end) from milliseconds-as-integer to seconds-as-float. No deprecation warning, no migration guide in the changelog. If you have any downstream parser — an n8n HTTP node, a custom subtitle renderer, anything that reads those timestamps — it will continue to run without error and produce timestamps that are off by a factor of 1000. You'll see clips labeled as starting at 0.5s when they should be 500ms, or worse, subtitle files where every cue fires in the first two seconds. Check your parsers before upgrading if you have production flows depending on segment timing.

The switch to faster-whisper as the default backend is a genuine win for cold-start latency, but the changelog undersells what it means for VRAM. faster-whisper uses CTranslate2 under the hood, which loads weights in a more compressed format. On my 32GB box, large-v3 via the original OpenAI Whisper backend was sitting around 10GB VRAM resident after the first job. With faster-whisper as the default, the same model loads closer to 6.5GB and first-token latency on a 5-minute audio file drops noticeably. The trade-off: the CTranslate2 build adds a native dependency, and if you're running a stripped-down container image, your first deploy after upgrading may fail with a missing shared library error rather than a clean startup message.

The new SPEAKR_BATCH_CONCURRENCY env var is one of those additions that sounds boring until you've had a GPU OOM-kill a transcription halfway through a 90-minute file. Previously, if three jobs hit the queue simultaneously, Speakr would spin up three parallel GPU contexts with no backpressure. On anything under 24GB VRAM, that's a crash waiting to happen. Now you can actually enforce a ceiling:

# docker-compose.yml fragment
environment:
  SPEAKR_BATCH_CONCURRENCY: "2"   # hard cap at 2 parallel GPU jobs
  SPEAKR_QUEUE_TIMEOUT: "300"     # seconds before a queued job is dropped
  SPEAKR_BACKEND: "faster-whisper" # now the default, but explicit is better

Setting this to 1 is the safest starting point on a shared workstation. Set it to 2 only if you've confirmed your model footprint leaves headroom — faster-whisper's lower VRAM floor makes this more viable than it was before. The queue will block and wait rather than fork a new process, so throughput goes down but reliability goes up. Worth it.

Word-level timestamps being on by default (--word_timestamps true) is the right call for most real use cases, but you should know what you're paying for. The latency overhead is real — budget roughly 15% extra wall-clock time on large-v3, more on longer files because the alignment pass scales with audio duration. If you're running batch overnight jobs where timestamp granularity doesn't matter, you can explicitly disable it to recover that time. But for anyone building subtitle pipelines, diarization prep, or anything that needs to sync text to a specific frame, having this on by default means your output is actually usable without a second post-processing pass. The segment-level timestamps were never precise enough for subtitle work; word-level timestamps are the minimum viable granularity for that use case.

Setup Walkthrough: Docker Compose with GPU Passthrough

The part that bites most people first isn't the GPU config — it's that Speakr v0.8.19 will happily pull a 3GB model file at container startup if it doesn't find the expected cache layout inside /models. That download blocks the health-check endpoint, Docker marks the container unhealthy, and your orchestrator kills and restarts it before transcription ever runs. Pre-stage the model first, configure second.

Here's a minimum viable docker-compose.yml that pins the image, passes the NVIDIA runtime through, and mounts both required volumes:

services:
  speakr:
    image: ghcr.io/speakr-oss/speakr:0.8.19   # pin — latest breaks on model path changes
    runtime: nvidia
    restart: unless-stopped
    volumes:
      - /mnt/models/speakr:/models             # pre-staged model files live here
      - /mnt/speakr-queue:/queue               # audio queue; survives container restarts
    environment:
      SPEAKR_MODEL: large-v3                   # which Whisper checkpoint to load
      SPEAKR_DEVICE: cuda                      # "cpu" is a valid fallback, ~8x slower
      SPEAKR_BATCH_CONCURRENCY: 2              # parallel decode slots; 1 if sharing GPU
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
    ports:
      - "8765:8765"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8765/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s   # give it time if model IS loading cold — longer than default
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

SPEAKR_BATCH_CONCURRENCY is the variable that matters most on a shared-GPU box. On my 32GB VRAM workstation running Ollama alongside Speakr, I keep this at 1 during Ollama's active window and bump it to 2 overnight. At 3, large-v3 plus a loaded Ollama model will OOM without warning. The model selection directly determines how much headroom you have:

large-v3 — roughly 10GB VRAM resident once loaded. Best word-error rate on accented speech, technical vocabulary, mixed-language audio. Use this if Speakr is your primary GPU tenant.
medium.en — roughly 5GB VRAM. English-only, noticably faster per-file, WER degrades on non-native speakers and proper nouns. Good middle ground when sharing with a 7B Ollama model.
small — roughly 2GB VRAM. Latency is fast enough to feel real-time on short clips. WER on anything with background noise or strong accents is bad enough to require a post-processing correction pass if accuracy matters.

To avoid the startup-download failure, pre-stage the model using huggingface-cli directly into your mounted path before the container ever starts:

# install once if you don't have it
pip install huggingface_hub

# download large-v3 into the exact directory Speakr expects
huggingface-cli download \
  openai/whisper-large-v3 \
  --local-dir /mnt/models/speakr/whisper-large-v3 \
  --local-dir-use-symlinks False

# verify the weights file is there — Speakr checks for model.safetensors
ls -lh /mnt/models/speakr/whisper-large-v3/model.safetensors

The --local-dir-use-symlinks False flag matters: Hugging Face's default symlink layout confuses Speakr's model loader on first boot, and the error message it gives back ("model not found, downloading") is misleading — the files are there, just not where it looks first.

One more failure mode that shows up specifically on long audio files: Speakr's upload endpoint accepts the file fine, but the transcription takes longer than your reverse proxy's read timeout. The proxy closes the connection, the client gets a 504, and the transcription finishes successfully in the container — invisibly, with no way to retrieve the result through the normal response path. In Nginx, set this on the location block handling the upload:

location /api/transcribe {
    proxy_pass          http://127.0.0.1:8765;
    proxy_read_timeout  300s;   # 5 min — covers ~45min audio on large-v3
    proxy_send_timeout  120s;
    client_max_body_size 512M;  # Speakr's default upload limit; match this
}

In Caddy the equivalent is reverse_proxy localhost:8765 { transport http { read_timeout 5m } }. Traefik handles it via the traefik.http.middlewares.speakr-timeout.forwardAuth.authResponseHeadersRegex path — or more cleanly, by setting readTimeout on the entrypoint itself rather than per-service, which avoids having to add labels to every container that shares that entrypoint.

Real Resource Costs on a 32GB VRAM Workstation

The VRAM split between Whisper and your LLM is the first thing you need to model before committing to a configuration. On my 32GB workstation, large-v3 via faster-whisper holds roughly 10GB VRAM resident while a transcription job is active — not allocated at startup, but from the moment the first audio chunk hits the model. That leaves ~22GB for Ollama, which is workable for a 13B or 34B quant as long as you're not triggering both simultaneously. The failure mode is subtle: if an n8n flow kicks off a transcription job while an LLM inference request is mid-generation, you'll hit fragmented VRAM pressure rather than a clean OOM — the transcription stalls or the LLM inference slows to a crawl without an obvious error. The fix is serializing at the orchestration layer, not the model layer.

If your workload actually needs concurrent transcription rather than LLM coexistence, the concurrency math works out cleaner with a smaller model. medium.en runs at roughly 5GB per worker, so:

# docker-compose or .env
SPEAKR_MODEL=medium.en
SPEAKR_BATCH_CONCURRENCY=2
# Total Speakr VRAM footprint: ~10-11GB
# Leaves ~21GB for a 7B Q4_K_M quant in Ollama (~4.5GB) with room to breathe

Two medium.en workers plus a 7B quantized model fits cleanly without serialization constraints. The accuracy delta between medium.en and large-v3 is noticeable on accented speech and domain-specific vocabulary, but for English interview recordings or meeting audio with clear speakers, medium.en is good enough that you won't fight the output downstream.

CPU fallback is more useful than it sounds, but only for specific scheduling patterns. Setting SPEAKR_DEVICE=cpu zeroes out VRAM usage entirely — the full GPU is free for LLM work around the clock. The cost is real: a 1-hour audio file that finishes in roughly 4 minutes on GPU takes ~35 minutes on CPU with large-v3. For overnight batch jobs where a PM2 cron queues up everything recorded during the day, that latency is irrelevant. For anything interactive — a user uploads a file and waits for a transcript — it's not usable. The right pattern is a configurable SPEAKR_DEVICE per deployment profile rather than a single value baked into your compose file.

Disk is the resource most people forget to track until the model cache directory is suddenly 15GB. large-v3 alone sits at ~3GB on disk. The sharp edge in Speakr's current behavior: changing SPEAKR_MODEL in your env and restarting pulls the new model but does not clean up the old one. After a week of experimentation across four or five model sizes, the cache directory accumulates all of them silently.

# Find your cache mount and audit it
docker exec speakr_app du -sh /root/.cache/huggingface/hub/models--Systran*/
# or wherever SPEAKR_MODEL_CACHE_DIR points in your compose

# Manual prune example — remove a specific model revision
rm -rf /data/speakr-cache/models--Systran--faster-whisper-medium/

There's no speakr prune command in v0.8.19 — pruning is entirely manual. If you're on a volume with a hard size limit, add a cleanup step to your deployment script whenever you change SPEAKR_MODEL, otherwise that volume silently fills and the next model pull fails mid-download with an unhelpful I/O error rather than a disk-space warning.

Connecting Speakr to n8n: Webhook Pipeline and the Timestamp Bug

The timestamp unit change in v0.8.19 is the kind of silent breakage that produces no errors, just subtitles that are off by a factor of a thousand. If you had a working pipeline before this update, your segment.start / 1000 conversion in any downstream Function node is now double-dividing — Speakr used to return milliseconds, this version returns seconds. The output still looks like a valid float, the workflow runs green, and your SRT file has timestamps like 00:16:40,000 where the audio says something at the ten-second mark. Delete the division entirely and treat the value as seconds from the start.

The basic wiring is straightforward: an n8n HTTP Request node POSTs a binary audio file to /api/v1/jobs, Speakr returns a job ID immediately, and then you choose between polling and webhooks. Polling is simpler to debug but burns time on long files. The webhook path is cleaner — register a callback URL on the job creation POST, and Speakr will fire a POST back to an n8n Webhook node when processing finishes. The critical gotcha here is that Speakr fires on both completion and failure, and the payload shape differs. A failed job still hits your webhook endpoint with a job_status: "failed" field and no segments array. If your downstream nodes assume segments exists, they'll throw a runtime error on any failed transcription job. The fix is a single IF node or a Function node guard before anything touches the segments array:

// n8n Function node — drop this before any segment processing
const payload = $input.first().json;

if (payload.job_status !== 'completed') {
  // surface the failure clearly rather than letting it blow up downstream
  throw new Error(`Speakr job failed: ${payload.job_id} — status: ${payload.job_status}`);
}

return $input.all();

On my own stack, the workflow starts with a folder watch — a Node.js script running under PM2 monitors a drop directory and triggers the n8n workflow via its own webhook endpoint when a new audio file lands. The n8n flow calls Speakr, waits for the completion webhook, then immediately pipes the segments array into a second HTTP Request node calling the local Ollama API (http://localhost:11434/api/generate) with the full transcript text and a summarization prompt. The summarization model I use for this is Mistral 7B — fast enough that the whole pipeline from drop to draft is under two minutes for a 20-minute audio file. The final step is a WordPress REST API call that creates a draft post with the summary as the body and the raw transcript stuffed into a custom field for reference.

// Reconstructing plain-text transcript from segments (post-v0.8.19)
// segment.start is already in seconds — no conversion needed
const segments = $input.first().json.segments;

const transcript = segments
  .map(seg => `[${seg.start.toFixed(2)}s] ${seg.text.trim()}`)
  .join('\n');

return [{ json: { transcript } }];

One practical note on the HTTP Request node configuration for the initial job POST: set the body content type to multipart/form-data and attach the binary audio item from earlier in the workflow using the Binary Data option — not base64. Speakr's file size limits are enforced at the HTTP layer, and if you base64-encode a large WAV before posting, you'll hit the limit with a file that would have been fine as a raw binary. Also register the webhook URL with a path that includes the job ID if you're running multiple concurrent transcription workflows; otherwise all completions land on the same endpoint and you'll need to demux them inside n8n based on the payload's job ID field, which is doable but adds complexity you don't need.

Three Non-Obvious Behaviors Worth Knowing

The diarization support is the biggest gotcha. Speakr's documentation uses language that implies speaker separation is a built-in capability, but what it actually ships is a forwarding layer. The SPEAKR_DIARIZATION_URL environment variable is not optional configuration for a bundled feature — it's the entire implementation. If that variable isn't set and pointing at a live pyannote-audio inference endpoint, diarization silently does nothing. You don't get an error. The transcript just comes back without speaker labels and the UI gives no indication why. You have to run pyannote-audio as a separate service, expose it on HTTP, and wire the URL in. Budget another container and a model download (the pyannote speaker diarization pipeline pulls several hundred MB of weights) before you treat diarization as a working feature.

# docker-compose excerpt — pyannote sidecar wired to Speakr
  speakr:
    image: speakr:0.8.19
    environment:
      # without this, diarization UI toggle does nothing
      SPEAKR_DIARIZATION_URL: http://pyannote:8000/diarize
    depends_on:
      - pyannote

  pyannote:
    image: your-pyannote-serving-image
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/torch  # reuse weight cache across restarts

The SQLite persistence problem will bite you the first time you update the container image. By default, Speakr writes its job database to a path inside the container filesystem. Pull a new image, spin up a new container, and everything — job history, completed transcripts, any stored results — is gone. The fix is a single volume mount, but it's not called out prominently in the v0.8.19 release notes. The database path is /app/data/speakr.db. Mount that directory as a named volume or a bind mount and your job history survives restarts and image updates. Without it you're running an amnesiac service.

  speakr:
    image: speakr:0.8.19
    volumes:
      # this one line is the difference between persistent and ephemeral jobs
      - speakr_data:/app/data

volumes:
  speakr_data:

Language auto-detection is more expensive than the docs suggest. Leaving SPEAKR_LANGUAGE unset causes Speakr to run Whisper's detection pass over the first 30 seconds of each audio file before transcription starts. On longer files that's a minor tax, but on a queue of short clips it adds up fast. The more operationally annoying issue is accuracy: accented English — particularly South Asian, West African, and Australian regional accents — gets misidentified as Hindi, French, or Portuguese with enough regularity to matter. When that happens you get a transcript in the wrong language with no error surfaced, just garbage output. If your use case has a known input language, set it explicitly:

environment:
  SPEAKR_LANGUAGE: en   # ISO 639-1 code; skips detection pass entirely

The detection failure mode is particularly frustrating to debug because the job shows as completed with a non-zero confidence score — the model is confident, just confidently wrong about which language it's transcribing. Pinning the language also gives you a small but consistent latency reduction on every job, so there's no downside if your audio source is predictable.

When to Use Speakr vs. Alternatives

The honest decision point is resource allocation: Speakr earns its place when you need a managed queue in front of Whisper and don't want to wire up Redis, a job worker, a REST layer, and a status-polling endpoint yourself. That plumbing is genuinely tedious, and Speakr ships it pre-assembled. If you're running multiple services hitting the same GPU — say, an n8n flow triggering transcriptions alongside a separate ingestion pipeline — the job queue is what keeps you from blowing up VRAM with concurrent Whisper loads. That's the actual value proposition, not the web UI.

For single-pipeline work, Speakr is overhead you don't need. If you have one Python script, one cron job, or one Node process that needs transcription and nothing else will ever share that path, faster-whisper directly is cleaner:

# faster-whisper, no server, no queue, just a function call
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")

Same story with whisper.cpp — if you're embedding transcription into a Go or C++ service and want a single binary with no Python runtime dependency, whisper.cpp with its server mode is the right call. Speakr's REST API adds round-trip HTTP overhead and a process boundary that only pays off when you actually need isolation and queuing.

Cloud APIs (Deepgram, AssemblyAI) win on latency at low concurrency — full stop. A self-hosted Whisper large-v3 model on a GPU that's also running inference for other workloads will not beat Deepgram's Nova-2 turnaround time on a 5-minute file. The break-even is data locality: if you're transcribing audio that can't leave your infrastructure, or if your volume is high enough that per-minute API costs become meaningful, self-hosted makes sense. But if your SLA is "user uploads a file and sees a transcript in under 10 seconds," a loaded local GPU is a liability, not an asset.

On v0.8.19 specifically: upgrade if you were hitting OOM crashes under concurrent load — the concurrency cap that shipped in this release is a direct fix for that failure mode, not a config tweak you can replicate yourself in older versions. Also upgrade if you need word-level timestamps in your output; that feature wasn't stable before this release. Hold off if your downstream parser expects millisecond integers for timestamp values — v0.8.19 changed the timestamp format in the response payload, and anything doing parseInt(segment.start) or treating it as a raw number will break silently. Check your consumer code before you roll it out.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Monitoring Kubernetes Clusters with OpenTelemetry Collector: The Agent + Gateway Pattern Explained

우병수 — Wed, 01 Jul 2026 08:10:54 +0000

TL;DR: The failure mode nobody talks about until it's happening in production: every pod opens a direct gRPC connection to your Tempo or Loki ingest endpoint, and the backend starts dropping spans not because it's overloaded on CPU, but because it hits its concurrent connection limit.

📖 Reading time: ~23 min

What's in this article

The Problem: Per-Node Chaos Without a Collection Strategy
Architecture Overview: Agent DaemonSet + Gateway Deployment
Deploying the Agent: DaemonSet Config That Actually Works
Deploying the Gateway: Where Batching and Sampling Live
Three Non-Obvious Behaviors That Will Burn You
Validating the Pipeline and What to Monitor About the Collector Itself
When to Use This Pattern and When to Skip It

The Problem: Per-Node Chaos Without a Collection Strategy

The failure mode nobody talks about until it's happening in production: every pod opens a direct gRPC connection to your Tempo or Loki ingest endpoint, and the backend starts dropping spans not because it's overloaded on CPU, but because it hits its concurrent connection limit. gRPC connections aren't free — each one holds state, negotiates keepalives, and occupies a file descriptor. At five nodes with a handful of pods each, this is invisible. At twenty nodes with autoscaling workloads, you're opening hundreds of persistent connections to a single endpoint that was sized for a fraction of that. The ingest service doesn't degrade gracefully; it starts returning RESOURCE_EXHAUSTED and your traces silently disappear.

The fan-out math is straightforward and brutal. If each node runs a DaemonSet collector that connects directly to your gateway, and each of those collectors opens connections for metrics (OTLP gRPC), traces, and logs separately, a 20-node cluster with three signal types is already at 60 persistent upstream connections minimum — before you account for any application-level SDK exporters that also decided to phone home directly. Tempo's default ingest configuration isn't built to handle that connection count from a single cluster, and Loki's distributor will start rejecting pushes under the same pressure. The spans don't queue; they're dropped at the exporter with a transient error that most SDKs log once and discard.

A DaemonSet-only deployment looks clean on paper: one collector per node, scrape local pods, forward upstream. The problem is "forward upstream" implies the DaemonSet agent is doing two jobs simultaneously — staying lightweight enough to not steal resources from the workloads it shares a node with, and being stateful enough to buffer, batch, retry, and route signals reliably. Those are contradictory requirements. A collector trying to hold a retry queue for failed exports while also scraping Prometheus endpoints on tight intervals will either starve its scrape loop under memory pressure or drop its queue when the node evicts it. The two roles genuinely need to be separate processes with separate resource profiles.

The architecture that actually works separates these concerns by design. The agent — running as a DaemonSet — is intentionally dumb and lean: receive signals from local pods via OTLP, do minimal processing (add node/pod labels, nothing expensive), and forward to an internal gateway endpoint. The gateway — running as a Deployment with persistent storage or at least a proper memory queue — handles everything stateful: batching, retry with backoff, TLS to the external backend, fan-in from all agents into a manageable number of upstream connections. The gateway sees maybe three or four connections going out, regardless of how many nodes are in the cluster. That's the connection count your Tempo ingest was actually sized for.

Trying to make one collector configuration do both roles causes resource contention in a specific and annoying way. The batch processor needs memory headroom proportional to the volume it's buffering. The prometheusreceiver needs CPU for scrape cycles. On a busy node, these compete. You'll see the batch processor's send queue fill up during a scrape spike, which causes backpressure into the receiver, which causes the exporter to the backend to time out, which triggers a retry loop — and now your lightweight DaemonSet pod is sitting at 400MB RAM and getting OOMKilled. The node comes back clean, the collector restarts, and you've lost whatever was in the queue. Split the roles and you size each component appropriately: agents at 64–128MB limits, the gateway at whatever the actual buffering workload demands.

Architecture Overview: Agent DaemonSet + Gateway Deployment

The split that most people miss when they first read the OpenTelemetry Collector docs is that you're not choosing between an agent and a gateway — you're running both, for different jobs. The agent is a resource-constrained process living on every node. The gateway is a proper service that absorbs all that data and does the expensive work. Conflating them into a single deployment is the fastest way to either blow your node memory budget or lose telemetry during a backend outage.

The agent runs as a DaemonSet — one pod per node, no exceptions. Its job list is narrow on purpose: scrape kubeletstats and hostmetrics for node-level data, tail container logs via the filelog receiver, and sit on localhost:4317 waiting for OTLP pushes from app containers on the same node. That last one matters: because the agent is node-local, sidecar or SDK instrumentation can hit it over loopback without any service discovery overhead. The memory ceiling for the agent should be hard-capped — something like 200–300Mi depending on log volume — because it has no persistent queue. If the gateway is unreachable, the agent drops data. That's a feature, not a bug. You do not want a DaemonSet silently accumulating gigabytes of retry buffer on every node in a partition event.

The gateway runs as a Deployment with two or more replicas behind a ClusterIP service. It receives OTLP over gRPC from every agent in the cluster, applies tail-based sampling decisions across the full trace (which requires seeing all spans for a given trace ID in one place — more on that in the sampling section), batches aggressively before forwarding, and owns all the retry queues and remote backend credentials. The gateway is where your Prometheus remote_write URL, your Tempo endpoint, and your Loki push URL live — as Kubernetes Secrets mounted into the gateway pods, not baked into a ConfigMap that every node-level ServiceAccount can read. The data flow in plain terms: your app pushes OTLP to the agent on localhost:4317, the agent forwards over gRPC to the gateway's ClusterIP (typically something like otel-gateway.monitoring.svc.cluster.local:4317), and the gateway fans out to backends — Prometheus gets metrics via remote_write, Tempo gets traces via OTLP HTTP or gRPC, Loki gets logs via its push API.

RBAC is where this split pays off in a concrete security way. The agent ServiceAccount needs real cluster permissions because it's doing discovery work:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-agent
rules:
  - apiGroups: [""]
    resources: ["nodes", "nodes/metrics", "pods", "endpoints", "services"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["extensions", "apps"]
    resources: ["replicasets"]
    verbs: ["get", "list", "watch"]
  # needed for kubeletstats receiver hitting /metrics/cadvisor
  - nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]

The gateway ServiceAccount needs none of that. It receives data over a network socket and pushes it to external endpoints. If someone misconfigures the gateway's collector config and opens an unintended receiver, the blast radius is zero Kubernetes API access. Keeping these two ServiceAccounts completely separate means a compromised or misconfigured gateway pod cannot enumerate your cluster topology, and a misconfigured agent pod cannot reach your remote backend credentials. That's the actual value of the split — not just resource isolation, but a meaningful reduction in what any single misconfiguration can touch.

Deploying the Agent: DaemonSet Config That Actually Works

The part most tutorials skip: the agent ConfigMap is where most production failures originate, not the gateway. A misconfigured memory_limiter processor doesn't gracefully shed load — it causes the collector pod to OOMKill and restart in a loop, taking your node-level metrics dark for 30–90 seconds per restart cycle. The processor must be listed first in every pipeline's processor chain, before batch, or it has no chance to act before memory is already blown. This is documented in the OpenTelemetry Collector docs but buried far enough that it's easy to miss on first read.

Here's a minimal but complete ConfigMap that actually runs. Real field names, real units — no placeholders:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otelcol-agent-config
  namespace: monitoring
data:
  config.yaml: |
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "https://${env:K8S_NODE_NAME}:10250"
        insecure_skip_verify: true
        metric_groups:
          - node
          - pod
          - container

      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          disk: {}
          filesystem:
            exclude_mount_points:
              mount_points: [/dev, /proc, /sys, /run/k3s/containerd]
              match_type: strict
          memory: {}
          network: {}
          load: {}

      filelog:
        include:
          - /var/log/pods/*/*/*.log
        include_file_path: true
        include_file_name: false
        operators:
          - type: router
            id: get-format
            routes:
              - output: parser-docker
                expr: 'body matches "^\\{"'
              - output: parser-crio
                expr: 'body matches "^[^ Z]+ "'
          - type: json_parser
            id: parser-docker
            output: extract-metadata
          - type: regex_parser
            id: parser-crio
            regex: '^(?P[^ Z]+) (?Pstdout|stderr) (?P[^ ]*) ?(?P.*)$'
            output: extract-metadata
          - type: move
            id: extract-metadata
            from: attributes["log"]
            to: body

    processors:
      memory_limiter:
        # must be first in pipeline — limits before batch can accumulate
        check_interval: 1s
        limit_mib: 220          # hard ceiling below the 256Mi pod limit
        spike_limit_mib: 60     # absorbs bursts without hitting the hard limit

      batch:
        send_batch_size: 1024
        timeout: 5s
        send_batch_max_size: 2048

      resourcedetection:
        detectors: [env, k8snode]
        timeout: 5s

      k8sattributes:
        auth_type: serviceAccount
        passthrough: false
        filter:
          node_from_env_var: K8S_NODE_NAME
        extract:
          metadata:
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.deployment.name
            - k8s.namespace.name
            - k8s.node.name
            - k8s.container.name

    exporters:
      otlp/gateway:
        endpoint: otelcol-gateway.monitoring.svc.cluster.local:4317
        tls:
          insecure: true   # mTLS is a gateway-layer concern; agent→gateway is cluster-internal

    service:
      extensions: [health_check]
      pipelines:
        metrics:
          receivers: [kubeletstats, hostmetrics, otlp]
          processors: [memory_limiter, resourcedetection, k8sattributes, batch]
          exporters: [otlp/gateway]
        logs:
          receivers: [filelog, otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp/gateway]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp/gateway]

On resource limits: 200m CPU and 256Mi memory is the right starting ceiling for the agent pod, not because of any single benchmark but because of where the pain points are. CPU for the agent is almost never the bottleneck — the 200m ceiling is defensive, preventing a misbehaving scraper from starving other node workloads. Memory is where you actually hit problems. The memory_limiter is configured at 220 MiB hard limit with a 60 MiB spike allowance, which leaves a small gap below the 256Mi pod limit. That gap is intentional: if the limiter somehow fails to shed fast enough, the container OOMKills cleanly rather than thrashing. Flip those numbers — set limit_mib above your container limit — and you get the worst outcome: the kernel kills the process before the limiter ever fires.

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

The filelog receiver requires two hostPath mounts that don't show up in the basic install guides. Without them the receiver starts cleanly but collects nothing, and the only evidence is silence in your log pipeline:

volumeMounts:
  - name: varlogpods
    mountPath: /var/log/pods
    readOnly: true
  - name: varlibdockercontainers
    mountPath: /var/lib/docker/containers
    readOnly: true
volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods
  - name: varlibdockercontainers
    hostPath:
      path: /var/lib/docker/containers

On containerd-only nodes (k3s, most current kubeadm clusters), /var/lib/docker/containers won't exist but the mount won't fail either — it'll just be empty. The actual log files are under /var/log/pods regardless of runtime, so that mount is the critical one. The docker path is worth keeping for mixed clusters or if you're running Docker-in-Docker workloads.

The health check extension at 0.0.0.0:13133 earns its place by catching a failure mode that metrics pipelines can't see: a collector that is running, passing its readiness check, but has internally deadlocked on a slow exporter. The liveness probe should hit / on port 13133, not the OTLP port. A deadlocked collector will stop updating its internal health endpoint within the check interval, triggering a pod restart before you even notice the gap in your metrics. Without this probe, a stuck collector can look healthy to Kubernetes while silently dropping every signal it receives.

livenessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /
    port: 13133
  initialDelaySeconds: 5
  periodSeconds: 10

Deploying the Gateway: Where Batching and Sampling Live

The single most important architectural decision in this whole pattern is where you put tail sampling — and the answer is never on the agent. An agent DaemonSet pod sees spans from one node. A trace for a single user request might touch pods on three different nodes, meaning each agent sees a fragment. If you apply a tail sampling policy at the agent, you're making a keep/drop decision on an incomplete picture. The policy fires when it thinks the trace is complete, but it isn't — the spans from the other nodes are just missing. You don't get an error. You get silently incomplete traces in Tempo, and you spend an hour wondering why your database call spans vanished.

The gateway is the right place because it receives OTLP from all agents and reconstructs full traces before evaluating any policy. The tail_sampling processor holds spans in memory for a configurable decision wait time, then applies your policies against the assembled trace. Below is a ConfigMap that wires this up end to end — a latency-based policy at 500ms, aggressive batching before the remote write, and the file storage extension that keeps you from losing a queue full of spans when the gateway pod restarts:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otelcol-gateway-config
  namespace: observability
data:
  config.yaml: |
    extensions:
      # file_storage keeps the persistent queue on disk across pod restarts.
      # Mount a PVC at this path — emptyDir will not survive a restart.
      file_storage:
        directory: /var/otelcol/queue
        timeout: 10s
        compaction:
          on_start: true
          directory: /var/otelcol/queue/compaction
          max_transaction_size: 65536

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      # tail_sampling MUST come before batch. The processor needs to see
      # individual spans to reconstruct traces; batching first breaks that.
      tail_sampling:
        decision_wait: 30s          # hold spans this long before deciding
        num_traces: 50000           # max traces in memory — tune to your heap
        expected_new_traces_per_sec: 200
        policies:
          - name: keep-slow-traces
            type: latency
            latency:
              threshold_ms: 500     # keep any trace with duration >= 500ms
          - name: keep-errors
            type: status_code
            status_code:
              status_codes: [ERROR]
          - name: probabilistic-baseline
            # Keep 10% of everything else so you have a baseline for fast paths
            type: probabilistic
            probabilistic:
              sampling_percentage: 10

      batch:
        send_batch_size: 8192
        timeout: 10s               # flush even if batch isn't full after 10s
        send_batch_max_size: 16384

      memory_limiter:
        check_interval: 1s
        limit_mib: 1500
        spike_limit_mib: 400

    exporters:
      # Traces → Tempo
      otlp/tempo:
        endpoint: tempo.observability.svc.cluster.local:4317
        tls:
          insecure: true
        sending_queue:
          enabled: true
          num_consumers: 4
          queue_size: 5000
          storage: file_storage    # references the extension above
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 300s

      # Metrics → Prometheus remote_write (Mimir or Grafana Cloud)
      prometheusremotewrite:
        endpoint: ${MIMIR_REMOTE_WRITE_URL}   # injected from Secret
        tls:
          insecure_skip_verify: false
        headers:
          Authorization: "Basic ${MIMIR_AUTH_HEADER}"   # injected from Secret
        sending_queue:
          enabled: true
          queue_size: 5000
          storage: file_storage
        retry_on_failure:
          enabled: true
          max_elapsed_time: 300s

    service:
      extensions: [file_storage]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, tail_sampling, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]

The file_storage extension is the piece most tutorials skip. Without it, the sending_queue lives entirely in memory, and any gateway restart — including a normal rolling update — drops whatever was queued. With it, the queue serializes to disk and resumes after the pod comes back. The critical detail: you must mount a PVC at /var/otelcol/queue, not an emptyDir. An emptyDir is per-pod ephemeral storage — the moment the container exits, the directory is gone, which is exactly the failure mode you're trying to prevent. A ReadWriteOnce PVC on any standard StorageClass handles this; you don't need anything exotic.

The credentials problem trips up most first deploys. The ${MIMIR_REMOTE_WRITE_URL} and ${MIMIR_AUTH_HEADER} placeholders above are not Helm template syntax — the OpenTelemetry Collector binary expands environment variable references in its config file at startup. That means you can inject secrets via envFrom without touching the ConfigMap at all:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otelcol-gateway
  namespace: observability
spec:
  template:
    spec:
      containers:
        - name: otelcol
          image: otel/opentelemetry-collector-contrib:0.102.0
          args: ["--config=/conf/config.yaml"]
          envFrom:
            # All keys in this Secret become env vars automatically.
            # Add MIMIR_REMOTE_WRITE_URL and MIMIR_AUTH_HEADER here.
            - secretRef:
                name: otelcol-gateway-secrets
          volumeMounts:
            - name: config
              mountPath: /conf
            - name: queue-storage
              mountPath: /var/otelcol/queue   # must match file_storage.directory
      volumes:
        - name: config
          configMap:
            name: otelcol-gateway-config
        - name: queue-storage
          persistentVolumeClaim:
            claimName: otelcol-gateway-queue
---
apiVersion: v1
kind: Secret
metadata:
  name: otelcol-gateway-secrets
  namespace: observability
type: Opaque
stringData:
  MIMIR_REMOTE_WRITE_URL: "https://mimir.example.com/api/v1/push"
  MIMIR_AUTH_HEADER: "dXNlcjpwYXNzd29yZA=="   # base64(user:password)

A few sharp edges worth flagging before you call this production-ready. First, tail_sampling and batch ordering matters more than the docs make clear — batch after tail sampling, not before, otherwise the batch processor groups spans from different traces together and the tail sampler can't reconstruct them correctly. Second, the num_traces: 50000 setting directly determines gateway memory pressure; at 30s decision wait, a spike in traffic can fill that buffer fast. Watch the otelcol_processor_tail_sampling_sampling_traces_on_memory metric and set a horizontal pod autoscaler trigger on it, not just CPU. Third, if you're sending to Grafana Cloud rather than self-hosted Mimir, the remote write URL format is slightly different — it includes /api/prom/push not /api/v1/push, and the auth is HTTP Basic against your Grafana Cloud stack credentials, not a bearer token.

Three Non-Obvious Behaviors That Will Burn You

The one that stings most operators first: the memory_limiter processor must be the first entry in every pipeline's processor list, not second or third. The instinct is to put batch first because batching "should happen before limiting," but that logic is backwards under load. When batch runs first, it accumulates spans and metrics into memory until the batch is full — then hands a large, already-allocated blob to memory_limiter, which can only reject it after the damage is done. The correct order is processors: [memory_limiter, batch] in every pipeline, every time, no exceptions. The collector docs mention this, but it's buried, and the default example configs don't always model it correctly.

# Correct processor order — memory_limiter gates before batch accumulates
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]   # NOT [batch, memory_limiter]
      exporters: [otlp/gateway]
    metrics:
      receivers: [kubeletstats, prometheus]
      processors: [memory_limiter, batch]
      exporters: [otlp/gateway]

The kubeletstats receiver's TLS error is a classic 30-minute timesink. When you see x509: certificate signed by unknown authority in the collector pod logs, the instinct is to check RBAC — service account permissions, ClusterRole bindings, whether the kubelet endpoint is accessible at all. Those are all fine. The error is purely TLS: the kubelet serves a self-signed cert that the collector can't verify because it doesn't have the cluster CA. Fix it one of two ways: pass insecure_skip_verify: true under the receiver's tls block (acceptable inside a private cluster network, not ideal), or mount the cluster CA and point the receiver at it. The permissions angle is a red herring that the error message does nothing to dispel.

receivers:
  kubeletstats:
    collection_interval: 20s
    auth_type: serviceAccount
    endpoint: "https://${env:K8S_NODE_NAME}:10250"
    insecure_skip_verify: true   # or: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    metric_groups: [node, pod, container]

Tail sampling with file_storage will quietly fill a node's disk if you're not watching it. The collector buffers entire traces to disk during the decision_wait window before it decides whether to sample them. A generous window (say, 30s) combined with a traffic spike means tens of gigabytes can accumulate before decisions flush. The collector does not self-limit this storage by default. Set a size_limit in the file_storage extension, and wire up an alert on the filesystem path — not just collector-level metrics. When the disk fills, the collector doesn't degrade gracefully; it fails writes, and you get silent trace loss without obvious errors at the application level.

extensions:
  file_storage/tail_sampling:
    directory: /var/otelcol/tail-sampling
    size_limit: 4GiB   # without this, growth is unbounded

processors:
  tail_sampling:
    decision_wait: 15s   # shorter window = less disk pressure during spikes
    storage: file_storage/tail_sampling

gRPC keepalive settings between agent and gateway are ignored until they suddenly matter. kube-proxy silently drops idle TCP connections after roughly 15 minutes — the exact timeout varies by cloud provider and kernel settings, but the behavior is consistent. Without keepalive_time and keepalive_timeout set on the agent's OTLP exporter, the connection goes idle during a quiet period, gets dropped at the proxy layer, and the next data push fails with a reconnect storm that produces errors resembling a gateway crash. You'll see connection refused or stream reset errors in the agent logs, gateway pod restarts won't help, and the root cause won't be obvious until you correlate the timing with kube-proxy idle timeouts. Set these explicitly on every agent exporter pointing at the gateway:

exporters:
  otlp/gateway:
    endpoint: "otel-gateway-collector:4317"
    tls:
      insecure: true   # internal cluster traffic; TLS termination handled at ingress
    keepalive:
      time: 10s          # send keepalive ping after 10s of inactivity
      timeout: 5s        # wait 5s for pong before declaring connection dead
      permit_without_stream: true   # keep the connection alive even with no active RPCs

Validating the Pipeline and What to Monitor About the Collector Itself

The most common mistake when deploying the agent + gateway pattern is trusting the pipeline because pods are running and no errors appear in logs. That's not validation — that's hope. The OpenTelemetry Collector exposes its own Prometheus metrics on :8888/metrics by default, and before you declare the pipeline production-ready, four specific counters need to be on a dashboard: otelcol_receiver_accepted_spans, otelcol_receiver_refused_spans, otelcol_exporter_sent_metric_points, and otelcol_exporter_send_failed_metric_points. If accepted spans are climbing but sent metric points are flat, your pipeline has a disconnect — likely a processor misconfiguration or a backend that's silently rejecting data. These four metrics give you the full picture: data arriving, data leaving, and data failing at both ends.

# Quick scrape to verify the metrics endpoint is live on an agent pod
kubectl exec -n observability ds/otel-agent -- \
  wget -qO- http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans

# Expected output looks like:
# otelcol_receiver_accepted_spans{receiver="otlp",transport="grpc"} 4821
# If this is zero after your app has been running, check receiver config first

Before you push any agent config to production nodes, otelcol validate is the correct first gate. It parses the full pipeline graph, checks that every referenced component is enabled, and catches type mismatches in processor config that would only surface at runtime. Pair that with a synthetic trace using otel-cli — a standalone binary that sends a real OTLP span — and you can confirm end-to-end flow from a single terminal session without touching your application at all.

# Config validation — runs in CI, costs nothing
otelcol validate --config=agent-config.yaml

# Synthetic trace to a locally-running collector (replace endpoint as needed)
otel-cli exec \
  --endpoint http://localhost:4317 \
  --name "smoke-test" \
  --service "validation-check" \
  -- echo "pipeline alive"

# Then immediately check the counter incremented
kubectl exec -n observability ds/otel-agent -- \
  wget -qO- http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans

On alerting thresholds: the metric that most operators misconfigure alerts on is otelcol_exporter_queue_size divided by otelcol_exporter_queue_capacity — queue utilization. When that ratio climbs above 70%, the instinct is to blame noisy instrumentation or a cardinality explosion on the app side. Usually it's the opposite: your backend (Tempo, Jaeger, a remote OTLP endpoint) is slow to acknowledge, so the exporter's send goroutines are backing up. The queue fills because of push-side latency, not because your app suddenly tripled its span rate. Set a separate alert on otelcol_receiver_accepted_spans rate for the noisy-app scenario — those are genuinely independent failure modes and conflating them wastes investigation time.

# Prometheus alerting rule that distinguishes queue pressure from volume spikes
groups:
  - name: otelcol-health
    rules:
      - alert: CollectorExporterQueueHigh
        expr: |
          otelcol_exporter_queue_size / otelcol_exporter_queue_capacity > 0.70
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Exporter queue above 70% — check backend latency, not app cardinality"

      - alert: CollectorExporterDropping
        expr: |
          rate(otelcol_exporter_send_failed_metric_points[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Collector is actively dropping data — backend unreachable or rejecting"

The operator mindset that makes this pipeline trustworthy — local validation before rollout, explicit metric confirmation, distinguishing which side of the pipeline is misbehaving — maps directly to how reliable local-first tooling gets built in general. If you're wiring AI-assisted tooling into your observability stack or CI pipelines, the same discipline applies: validate locally, instrument the tool itself, and don't trust that something is working just because it didn't crash. The AI Coding Tools in 2026: Cloud Copilots vs Local Models guide covers how that operator mindset translates to choosing and running dev tools — worth reading alongside this if you're building automation that touches your infra.

When to Use This Pattern and When to Skip It

The most common mistake with OpenTelemetry in Kubernetes is adding the gateway tier reflexively — because the diagram looks like the "right" architecture. The gateway is a real operational burden: it's another Deployment to resource-tune, another config to version, another thing to be down when you're debugging at midnight. Add it only when something specific forces your hand.

Add the gateway tier when you cross roughly eight nodes. Below that, the per-node agent DaemonSet pods can push directly to your backend without meaningful fan-out problems. But past that threshold, a few concrete pressures appear simultaneously: tail sampling requires seeing all spans for a given trace in one place, which a DaemonSet pod physically cannot do since different services may run on different nodes. If your backend — Grafana Cloud, Honeycomb, a managed Tempo instance — enforces rate limits or requires a per-tenant API key, you don't want that credential baked into every DaemonSet pod config and replicated across 20 nodes. And when you eventually swap from Jaeger to Tempo, or add a second backend for compliance, doing that in one gateway config file beats rolling a DaemonSet update across the entire fleet.

Skip the gateway entirely when your cluster is three to five nodes and your observability backend lives inside the cluster — say, Victoria Metrics and Grafana running as in-cluster Deployments. Head-based sampling is good enough for a small cluster where you're not drowning in span volume, and an internal backend means there's no auth boundary or rate-limit problem to solve centrally. In that topology, a gateway Deployment just sits between the agent and the backend doing nothing useful while consuming memory you'd rather give to workloads. The agent pods can push OTLP directly to an in-cluster Tempo or Loki endpoint, and the whole config stays in one DaemonSet manifest.

The hybrid case is the one the official docs gloss over but shows up constantly in real clusters: run Prometheus scrape on the agent for node-level metrics (CPU, memory, disk via the hostmetrics receiver), but route traces exclusively through the gateway. The reasoning is resource efficiency. Node metrics are high-cardinality and high-volume — pushing them through the gateway's batch processor means the gateway is doing expensive buffering and compression on a stream that doesn't benefit from centralization, because there's no "tail sampling for metrics" concept. Traces, on the other hand, need the central batch processor to correlate spans across nodes before sampling decisions get made. A practical split looks like this in the agent config:

exporters:
  # Metrics go direct to Victoria Metrics — no gateway hop
  prometheusremotewrite:
    endpoint: "http://victoria-metrics.monitoring.svc:8428/api/v1/write"

  # Traces go to the gateway for tail sampling
  otlp/gateway:
    endpoint: "otelcol-gateway.monitoring.svc:4317"
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [hostmetrics, kubeletstats]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/gateway]

This keeps the gateway's batch processor focused on trace correlation where it earns its keep, and lets the metrics pipeline take the short path. The tradeoff is that you're maintaining two separate pipeline configs and two separate backends, so don't reach for this unless you're actually seeing the gateway become a bottleneck on metric throughput — which typically means the gateway pod's memory limit is getting hit on clusters with dense node-metrics scrape intervals under 15 seconds.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Code Mode + MCP: Wiring Your Local LLM Into a Real Development Workflow

우병수 — Mon, 29 Jun 2026 08:11:56 +0000

TL;DR: The hallucination that wastes the most time isn't a wrong algorithm — it's a wrong tool call. Your LLM confidently writes `prisma db pull --schema=.

📖 Reading time: ~24 min

What's in this article

The Problem: Your AI Coding Assistant Lives in a Walled Garden
How MCP Fits Into a Code Mode Agent Loop
Setting Up an MCP Server for Your Dev Environment
Routing Between Local Ollama Models and Cloud APIs Inside Code Mode
Wiring MCP Into an Automated Dev Pipeline (n8n + Webhooks)
Three Non-Obvious Behaviors That Will Cost You Time
When This Setup Is Not the Right Call

The Problem: Your AI Coding Assistant Lives in a Walled Garden

The hallucination that wastes the most time isn't a wrong algorithm — it's a wrong tool call. Your LLM confidently writes prisma db pull --schema=./prisma/schema.prisma against a database that's actually on a non-standard port behind a tunnel, or generates a Docker Compose service block referencing an image tag that hasn't existed since you migrated registries. The model knows your code because you pasted it. It has no idea what's actually running.

This is the structural gap: the model's context window is a snapshot, not a live connection. You copy in a schema file, a config block, maybe some logs — and the moment any of those change, the model is reasoning against stale state. It can't ask your Postgres instance what columns actually exist. It can't check whether your localhost:3000 health endpoint is up. It can't read the compiled output in /dist to verify its own edits worked. So it fills the gaps with confident guesses, and those guesses are wrong in proportion to how far your actual stack drifts from "typical project".

Model Context Protocol fixes this at the architecture level rather than the workflow level. Instead of you manually copying context into a chat window, MCP defines a typed tool-calling surface — the model can call read_file, run_command, query_database, or any custom tool you expose, and get structured responses back. The key word is typed: tool inputs and outputs are defined schemas, so the model knows exactly what arguments are valid and what shape the response will have. That eliminates an entire class of hallucinated API calls where the model invents parameters that don't exist in your actual SDK.

Code Mode in tools like Cursor, Cline, and Continue tightens this further because they run an agentic loop — plan a change, write the edit, verify it compiled or the test passed, then loop. That verify step is where everything falls apart without reliable tool access. If the model can't actually run tsc --noEmit and read stderr, it either skips verification entirely (optimistically assuming success) or asks you to run it and paste back the output, which defeats the loop. MCP gives that verify step a real execution surface. The plan→edit→verify cycle becomes autonomous instead of human-relayed. For a broader map of where this fits, see our guide on AI Coding Tools in 2026: Cloud Copilots vs Local Models.

The practical consequence is that walled-garden assistants optimize for demo quality, not operational accuracy. They look impressive on greenfield TypeScript with a clean schema. They fall apart on the actual project: the monorepo with four package managers, the legacy MySQL 5.7 table with implicit nulls everywhere, the build pipeline that requires three env vars that aren't in any file the model has seen. MCP doesn't make the model smarter — it makes the model's knowledge current, specific, and verifiable against the system that's actually running.

How MCP Fits Into a Code Mode Agent Loop

The surprising thing about MCP isn't the protocol itself — it's how little the model actually needs to understand about it. The model sees a flat list of typed tool signatures, picks one, emits a JSON call, and gets a result back. That's it. The JSON-RPC plumbing — server discovery, capability negotiation, result framing — happens entirely in the client layer. Concretely: your editor plugin (the MCP client) maintains a list of running MCP servers; each server advertises its tools as typed schemas; when the model decides to call read_file or run_tests, it's doing the same thing it does when calling any function-calling API. The model never speaks directly to the MCP server. The client brokers everything.

Code Mode is where MCP actually earns its keep, and the distinction from Chat Mode matters operationally. Chat Mode is stateless turn-by-turn: the model answers, forgets, moves on. Code Mode maintains a task context — open files, recent edits, tool call history, failure state — and will re-invoke a tool if the previous attempt returned an error or produced unexpected output. That re-invocation behavior is the sharp edge. Your MCP servers must be stateless and idempotent, because the agent will retry them without knowing what side effects the first call left behind. A shell tool that runs npm install twice should produce the same result as running it once. A file-write tool that gets called twice with the same payload shouldn't corrupt the file. If your server holds mutable state between calls, Code Mode will eventually find the broken edge.

Local models are where this architecture develops cracks. Tool-call reliability — meaning: the model emits well-formed JSON that correctly matches the tool's argument schema — degrades noticeably on quantized models below Q5 quantization. A 7B model under load will produce malformed JSON arguments, miss required fields, or hallucinate argument names that don't exist in the schema. This isn't a fringe case; it happens regularly in agentic loops where the context window fills with prior tool results. On my 32GB VRAM workstation, Q4_K_M at 32B is the practical floor for running a reliable Code Mode loop. Below that, you're spending more time writing retry logic and output validators than you're saving by running locally. For anything business-critical, the fallback to a cloud model isn't a failure — it's the right call.

Three MCP server types cover the majority of real development workflows:

Filesystem servers — read and write project files, list directory trees, watch for changes. The agent uses these to inspect source, patch files, and verify its own edits. The main gotcha: scope the allowed paths tightly in the server config, or the agent will happily traverse your entire home directory looking for context.
Shell/process servers — run test suites, linters, build commands, and capture stdout/stderr. These are the highest-value tools in a Code Mode loop because they close the feedback cycle: write code, run tests, read failures, patch, repeat. Keep timeouts aggressive — a hanging test suite will stall the entire agent loop.
Service connectors — Postgres queries, REST API calls, git operations. Useful when the task involves understanding live schema state, checking API responses, or inspecting commit history. These are also the most dangerous from an idempotency standpoint: a tool that fires a POST request on retry can create duplicate records. Wrap mutating calls in dry-run flags or explicit confirmation steps.

Setting Up an MCP Server for Your Dev Environment

The TypeScript SDK wins on completeness — it ships with typed tool schemas, Zod validation helpers, and the full transport abstraction layer out of the box. The Python SDK is fine for pure data tools, but if your MCP server shells out frequently (running lint, parsing files, invoking CLI tools), the startup overhead on each subprocess call compounds fast. For a dev environment server that needs to feel snappy inside Cursor or Cline, start with @modelcontextprotocol/sdk on Node 20+.

Here's a minimal but real server scaffold — tool registration, typed input schema, stdio transport — in under 35 lines:

`typescript
// mcp-dev-server/src/index.ts
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { execSync } from "child_process";

const server = new McpServer({
name: "dev-tools",
version: "0.1.0",
});

// Register a tool — schema is validated before your handler runs
server.tool(
"run_eslint",
{
filepath: z.string().describe("Relative path inside workspace"),
},
async ({ filepath }) => {
// Don't trust the model to pass safe paths — validate against your allowed root
const allowed = process.env.WORKSPACE_ROOT ?? "/workspace/src";
if (!filepath.startsWith(allowed)) {
return { content: [{ type: "text", text: "Path outside allowed root" }] };
}
const result = execSync(npx eslint --format compact ${filepath}, {
encoding: "utf8",
timeout: 15_000,
});
return { content: [{ type: "text", text: result }] };
}
);

const transport = new StdioServerTransport();
await server.connect(transport);
// process stays alive — MCP host controls the lifecycle via stdin/stdout
`

The mcp.json config is what your editor actually reads to know which servers exist and how to spawn them. Cursor looks in .cursor/mcp.json at the project root; VS Code with the MCP extension uses .vscode/mcp.json. The structure is the same either way — a named map of server entries with command, args, and optional env. Here's a realistic two-server config with a filesystem tool jailed to a subdirectory and a Postgres query tool:

json // .cursor/mcp.json { "mcpServers": { "filesystem": { "command": "npx", "args": [ "-y", "@modelcontextprotocol/server-filesystem", "--rootPaths", "./src", // ← jailed to src/, not the repo root "./tests" ] }, "postgres": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-postgres"], "env": { "POSTGRES_CONNECTION_STRING": "postgresql://readonly_user:pass@localhost:5432/mydb" } } } }

The rootPaths argument on the filesystem server is the single most important security control most people skip. Omit it and the server defaults to whatever working directory the subprocess inherits — often your entire repo root or home directory. A model operating in code mode can and will read .env files, private keys, or node_modules/.cache if the path is reachable. Jail it to ./src and ./tests only, and use a read-only Postgres role for the DB server — not your migration user. These aren't theoretical risks; they're the first two things to audit before you let any agentic flow run unsupervised.

The stdio vs SSE transport distinction will bite you the first time you wonder why your server loses state mid-session. Stdio (the default in Cursor and Cline) spawns your server as a subprocess and communicates over stdin/stdout — the process is killed when the session closes, which means any in-memory state, open DB connections, or file watchers disappear with it. SSE keeps a long-running HTTP server and multiplexes sessions over it, which is what you want if your server does expensive initialization (loading an embedding model, opening a connection pool). The cost is operational: you need a running process, a port, and CORS headers set to allow your editor's origin. For most local dev setups, stdio is fine and simpler. Switch to SSE only when you're sharing one server instance across multiple clients or need persistent state between tool calls.

Routing Between Local Ollama Models and Cloud APIs Inside Code Mode

The routing decision that actually matters isn't "local vs. cloud" as a preference — it's token estimate as a proxy for task complexity. A 40-token autocomplete request has no business hitting Claude Sonnet over an API call with 200ms round-trip latency. But a multi-step agent task that needs to read three files, reason about dependencies, and emit a patch? That's where a 7B model starts hallucinating import paths. The split I run is a simple token-count threshold with a task-type override:

`typescript
// route.ts — called before every model dispatch
const FAST_MODEL = "qwen2.5-coder:7b"; // Ollama local, ~5GB VRAM
const STRONG_LOCAL = "qwen2.5-coder:32b-q4_k_m"; // ~22GB VRAM, used for agents
const CLOUD_FALLBACK = "claude-sonnet-4-5"; // API, for when local is saturated

type TaskType = "autocomplete" | "inline_edit" | "agent" | "multi_file";

Wiring Cline or Continue to Ollama's OpenAI-compatible endpoint sounds trivial until it isn't. The config looks like this in Continue's config.json:

json { "models": [ { "title": "Qwen2.5-Coder 7B", "provider": "ollama", "model": "qwen2.5-coder:7b", "apiBase": "http://localhost:11434/v1" } ] }

Two things burn people here. First: the model name must be the exact tag string from ollama list — qwen2.5-coder:7b not qwen2.5-coder or qwen2.5-coder:latest. Ollama will 404 silently or return a generic error that Continue surfaces as "model not found" with no useful hint. Second, and this one is subtle: a trailing slash on apiBase breaks tool-call parsing in Cline's MCP dispatch layer. http://localhost:11434/v1/ causes malformed endpoint concatenation downstream — the tool-call response comes back as plain text instead of JSON, and the MCP server never receives the structured call. No trailing slash, ever.

The VRAM arithmetic on a 32GB card is tight but workable if you're disciplined. The 32B Q4_K_M model loads to roughly 22GB. An active filesystem MCP server process plus a live Postgres MCP connection adds maybe 1-2GB of system VRAM overhead depending on your driver and how much the MCP servers cache. That leaves you 8-9GB of headroom — comfortable for a single agent session where the KV cache stays bounded. The problem is the second agent session. Once you're over 30GB allocation, the driver starts paging KV cache to system RAM, and latency on generation steps jumps 4-8x on typical query lengths. You'll feel it as stalls mid-stream rather than a clean slow response. The fix is to serialize agent sessions at the dispatch layer, not the model layer — queue the second request until the first completes rather than letting both run concurrently.

The health-check-before-dispatch pattern is what keeps the TypeScript automation engine from failing visibly when Ollama is saturated or mid-reload. Ollama's /api/tags endpoint responds in under 5ms when the service is up and becomes immediately unreachable when it's not — it's a better liveness signal than /api/generate with a dummy prompt:

`typescript
// health.ts
async function isOllamaHealthy(timeoutMs = 800): Promise {
try {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeoutMs);
const res = await fetch("http://localhost:11434/api/tags", {
signal: controller.signal,
});
clearTimeout(timer);
return res.ok;
} catch {
// connection refused OR timeout — treat both as unhealthy
return false;
}
}

// In the dispatch loop:
const healthy = await isOllamaHealthy();
const model = routeModel(estimatedTokens, taskType, healthy);
// If model === CLOUD_FALLBACK, swap the client to OpenAI SDK — same interface, different baseURL
`

The 800ms timeout is deliberate. Anything longer and a slow-but-alive Ollama instance under load will pass the health check but then time out on the actual generation request — you've just added latency without gaining reliability. If it can't answer /api/tags in under a second, treat it as unavailable and route to the cloud client. The OpenAI SDK and the Ollama OpenAI-compatible endpoint share enough interface surface that swapping the baseURL and adding an API key is the only code change needed — keep both clients instantiated and select between them after the health check resolves.

Wiring MCP Into an Automated Dev Pipeline (n8n + Webhooks)

The most disruptive part of this whole setup isn't the agent — it's realizing you can close the loop between your repo and your local reasoning layer without touching a single SaaS CI product. A GitHub webhook fires, your local n8n instance picks it up, an MCP-capable agent reads the diff, runs your actual test suite, and posts a structured review back to the PR. The whole chain runs on your workstation. Here's how to wire it.

GitHub Webhook → n8n → MCP Agent → PR Comment

The entry point is a Webhook node in n8n listening on something like /webhook/github-pr. You authenticate it with a shared secret checked against the X-Hub-Signature-256 header — n8n's Webhook node exposes the raw body, so you can verify it in a Function node before anything else runs. After verification, extract pull_request.number, pull_request.diff_url, and repository.full_name from the payload. A second HTTP Request node fetches the raw diff from diff_url using your GitHub token. That diff string goes into the body of the call to your local MCP agent endpoint.

The HTTP Request node hitting your MCP-capable agent endpoint should look like this:

json { "model": "qwen2.5-coder:32b", "messages": [ { "role": "user", "content": "Review this diff for correctness, style violations, and test coverage gaps. Diff:\n\n{{$json.diff}}" } ], "tool_choice": "auto", "tools": [ { "type": "function", "function": { "name": "mcp__shell__run_command", "description": "Run a shell command on the local machine via MCP shell tool", "parameters": { "type": "object", "properties": { "command": { "type": "string" }, "cwd": { "type": "string" } }, "required": ["command"] } } }, { "type": "function", "function": { "name": "mcp__filesystem__read_file", "description": "Read a file from the registered workspace root", "parameters": { "type": "object", "properties": { "path": { "type": "string" } }, "required": ["path"] } } } ] }

The tool names use the mcp__<server>__<tool> namespacing convention. When the agent decides to run tests, it will return a tool call for mcp__shell__run_command with something like "command": "cd /workspace/myrepo && npm test -- --reporter=json". Your n8n workflow catches that tool call response, executes the dispatch back to the MCP server's tool endpoint, feeds the output back into a follow-up messages array, and loops until the agent returns a plain text completion — which then gets POST'd to the GitHub PR comments API as a structured review.

The Version Mismatch Failure Mode

The most reliable way to break this pipeline silently is to run two different versions of your MCP server — one that your editor (Cursor, VS Code with Cline, whatever) registered tools against, and a different one that n8n is hitting at runtime. The agent returns a tool call like mcp__shell__run_command, your MCP server doesn't recognize it because the tool was renamed or its parameter schema changed in a patch release, and you get a generic 400 or an unhandled tool error that n8n logs as a completed node. The PR comment never gets posted. You don't notice until someone asks why the bot went quiet.

Fix this in two places. First, pin the MCP server package version in your Dockerfile or package.json — no floating ^ or ~:

json // package.json — MCP server process { "dependencies": { "@modelcontextprotocol/server-filesystem": "0.6.2", "@modelcontextprotocol/server-shell": "0.4.1" } }

Second, add a /info route to your MCP server wrapper that returns the pinned version. Then have your n8n workflow hit /info as its first node and assert the version matches what it expects before doing anything else. A mismatch kills the workflow run immediately with a descriptive error rather than a ghost failure downstream:

typescript // Express wrapper around your MCP server app.get('/info', (req, res) => { res.json({ mcp_server_version: process.env.npm_package_version, // reads from package.json tools: registeredTools.map(t => t.name), started_at: startTime.toISOString() }); });

Nightly Unattended Loop via PM2 Cron

The same agent-plus-MCP pattern runs as a nightly static analysis sweep with zero human involvement. The PM2 ecosystem config schedules a Node script that calls git log to get all files changed since the last tagged deploy, sends them through the MCP-capable agent with the filesystem and shell tools available, instructs the agent to auto-apply ESLint fixes only on functions with cyclomatic complexity below a threshold (I use 10 — above that, it flags rather than touches), commits the result, and opens a draft PR via the GitHub API. If nothing changed or no fixable issues were found, it exits cleanly and PM2 logs the no-op.

javascript // ecosystem.config.js module.exports = { apps: [{ name: 'nightly-lint-agent', script: './scripts/nightly-lint-loop.js', cron_restart: '0 2 * * *', // 2 AM local time autorestart: false, // don't restart on exit — this is a one-shot job watch: false, env: { MCP_ENDPOINT: 'http://localhost:3100', GITHUB_TOKEN: process.env.GITHUB_TOKEN, COMPLEXITY_THRESHOLD: '10', BASE_BRANCH: 'main' } }] };

One gotcha that doesn't show up until the second or third run: if the agent opens a draft PR but the nightly job runs again before anyone closes it, you'll accumulate stacked draft PRs against the same base. Guard against this by querying the GitHub API for open draft PRs authored by your bot token before the agent does any work — if one already exists for the same base branch, append to it rather than opening a new one, or skip the run entirely. The GitHub search API supports is:pr is:draft author:app/your-bot-name base:main which is fast enough to call synchronously at job start.

Three Non-Obvious Behaviors That Will Cost You Time

The behaviors that actually cost you hours aren't the ones in the error logs — they're the ones that produce wrong but plausible output and silent data corruption. Here are three that don't show up in the MCP spec until you've already been burned by them.

Tool Call Cancellation Is Not Guaranteed

If the model emits a partial tool call payload and the session drops mid-flight, several MCP clients will queue that call for retry on reconnect. The spec treats tool calls as idempotent by convention, but nothing in the protocol enforces that — and your file-write tool almost certainly isn't idempotent. The failure mode: reconnect triggers a second write_file call with the same arguments, and your patch gets applied twice. With line-insertion edits, that means duplicate blocks. With find-and-replace, the second pass may silently corrupt the file by matching on already-modified content.

The fix is a content guard on every write tool, not an assumption that the client handles this. Before applying any edit, read a hash of the current file state and compare it against a hash captured when the task was first dispatched:

`typescript
// write_tool_handler.ts
async function applyEdit(params: EditParams): Promise {
const currentHash = await sha256OfFile(params.path);

// caller passes the hash they saw when they read the file
// if hashes diverge, the file changed under us — refuse the write
if (currentHash !== params.expectedHash) {
return {
success: false,
reason: content_mismatch: expected ${params.expectedHash}, got ${currentHash}
};
}

await applyPatch(params.path, params.patch);

return {
success: true,
newHash: await sha256OfFile(params.path)
};
}
`

A line-range guard works too and is simpler to reason about for bounded edits: the tool refuses to write if the target line range has shifted. Either way, the model gets a clean error it can reason about instead of silently writing corrupted output. The extra round-trip is worth it every time.

Context Window Bleed Kills Long Agent Sessions

Code mode accumulates tool call results directly in the context window — every read_file response, every lint output, every test result goes in as a message. On a 32B model running at 32K context, around 15 tool calls in, the model starts behaviorally ignoring early system instructions. It doesn't error out. It just stops following the task constraints you set at the top of the session — the coding style rules, the file exclusion list, the output format requirements. This is a soft attention failure, not a hard truncation, and it's insidious because the output still looks reasonable.

The practical mitigation is task checkpointing: break any job that requires more than 8-10 tool calls into explicit subtasks, each with its own fresh context. In my n8n flows, I do this by treating each subtask as a separate HTTP call to the model endpoint, passing only a compressed summary of prior state rather than the full tool call history. Something like a 200-token "checkpoint header" that says files modified so far, constraints still active, next objective — assembled by a lightweight summarizer node before each subtask starts:

`http

n8n HTTP Request node — subtask dispatch

POST /api/generate
{
"model": "qwen2.5-coder:32b",
"system": "{{ $json.checkpointHeader }}",
"messages": [
{ "role": "user", "content": "{{ $json.subtaskPrompt }}" }
],
"context_length": 32768
}
`

The checkpoint header is cheap to generate and completely sidesteps the bleed problem. The tradeoff is that you lose conversational continuity — the model can't reference specific earlier outputs by memory. For code generation tasks that's almost never a problem; for exploratory debugging sessions where context coherence matters, you may prefer a smaller model with a larger context window instead.

Resources vs Tools: The Cold-Start Latency You're Leaving on the Table

MCP draws a hard architectural line between resources and tools. Resources are fetched and cached by the client at session initialization — they're available to the model with no round-trip cost. Tools are called on demand, which means a network hop plus server-side execution every single invocation. Most MCP server implementations default to exposing everything as a tool because it's simpler to implement, and most tutorials do the same. The cost is paid at cold start: if your agent calls describe_schema or list_available_endpoints at the top of every session, you're paying 2-4 seconds per session just to hand the model static information it could have had for free.

Anything that doesn't change between sessions belongs in a resource. Schema definitions, API surface documentation, file tree snapshots of stable directories, environment capability lists — these are all resource candidates. Exposing them correctly in your MCP server definition looks like this:

`yaml

mcp_server_config.yaml

resources:

uri: "schema://db/main"
name: "Main database schema"
mimeType: "application/json"

served from a static file or a cached query result refreshed on server start

handler: "handlers.schema.serve_main"
uri: "schema://api/openapi"
name: "Internal API surface"
mimeType: "application/yaml"
handler: "handlers.schema.serve_openapi"

tools:

name: "run_query" description: "Execute a read-only SQL query" # this is dynamic — keep it as a tool handler: "handlers.db.run_query" `

The client fetches all declared resources during the handshake phase and includes them in the initial context before the first user message. The model has the schema available immediately without any tool call overhead. The common gotcha here: if your resource handler hits a slow database or a remote API at serve time, you've moved the latency problem rather than eliminated it. Cache the resource payload on server startup and refresh it on a schedule, not on every client connection.

When This Setup Is Not the Right Call

The filesystem MCP tool's recursive read is genuinely useful on a focused project directory — it stops being useful the moment your codebase crosses into monorepo territory. Past roughly 500K LOC, the tool will either get truncated by the model's context window before it assembles a coherent picture, or it'll burn most of the context budget on directory traversal before a single line of analysis happens. If you're in that situation, the right architecture is a code embedding index with semantic search in front of it. I use bge-m3 for this — it's dense enough to handle code tokens well and fits comfortably in a moderate VRAM budget. The MCP layer then becomes a thin retrieval interface rather than a raw filesystem reader: the agent queries the index, gets back the 10-15 most relevant chunks, and works from those instead of trying to read the repo whole.

GPU constraint is the other hard wall. The agentic loop that makes Code Mode + MCP actually useful requires a model capable of multi-step tool use and self-correction — on my 32GB VRAM box that's workable with a quantized 32B or a pair of smaller specialized models. Under 16GB VRAM, you don't have that headroom. A 7B or 8B model will accept MCP tool calls, but the planning quality degrades fast on anything beyond two-step tasks; you'll spend more time fixing the agent's mistakes than you would have spent writing the code. The practical answer is to route the agent to a cloud API — Claude 3.5 Sonnet or GPT-4o handle the reasoning — and keep MCP tool permissions strictly read-only. Read-only means the worst case is a wasted API call, not an overwritten file or an executed shell command you didn't intend.

Shared dev environments are where this setup can become a genuine security problem rather than just an inconvenience. An MCP server with shell exec permissions is an arbitrary code execution surface. On a single-operator workstation where you control the process list, what the server can spawn, and what credentials are in the environment, that's a manageable risk — you're the only lateral movement target and you can audit it. On shared infra — a team dev box, a cloud VM that multiple engineers SSH into, a Kubernetes pod with a mounted service account — an MCP server with exec access is a privilege escalation path waiting to happen. One misconfigured tool definition or one prompt injection through a malicious code comment, and the blast radius extends to every other user and service that machine can reach. This architecture was designed for single-operator use; treating it as a team tool without a significant rethink of the permission model is the wrong call.

Monorepo scale (>500K LOC): Replace recursive filesystem reads with bge-m3 embeddings + semantic retrieval; the MCP tool becomes a query interface, not a directory walker.
Under 16GB VRAM: Use a cloud API for the agent brain, lock MCP tools to read-only, accept that local inference isn't the bottleneck worth optimizing here.
Shared infrastructure: MCP shell access on multi-user systems is a lateral movement risk; the single-operator assumption baked into this setup doesn't hold and the permission model needs a full redesign before it's safe to deploy that way.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Bind9 vs PowerDNS at Home: Which One to Run When You Actually Need Local DNS

우병수 — Fri, 26 Jun 2026 08:11:57 +0000

TL;DR: The silent failure mode is the one that gets you. Your router's built-in resolver — whatever dnsmasq stub is baked into your OpenWrt or consumer firmware — handles `nas.

📖 Reading time: ~20 min

What's in this article

The Problem That Makes Local DNS Worth Running
What Each Tool Actually Is (And Isn't)
Setting Up Bind9: Authoritative Server for Internal Zones
Setting Up PowerDNS: When You Want an API and a Database Backend
Bind9 vs PowerDNS: The Honest Operator Comparison
Running Either One in Docker Without Losing Your Mind
What to Point Your Clients At and How to Test It

The Problem That Makes Local DNS Worth Running

The silent failure mode is the one that gets you. Your router's built-in resolver — whatever dnsmasq stub is baked into your OpenWrt or consumer firmware — handles nas.local and homeassistant.local just fine with two devices. Add Nginx Proxy Manager routing to five services behind a single IP, throw in a Traefik instance for containers, and you'll start hitting a specific class of breakage: the hostname resolves, but to the wrong thing, or resolves on some devices and not others, with zero error messages to tell you why. The router just silently returns whatever it last cached or makes a best-guess mDNS query that works on macOS and fails on Linux.

The /etc/hosts approach feels like a fix but compounds the problem. You add 192.168.1.50 grafana.home.lab to your laptop, forget to add it to your phone, your n8n container can't resolve it at all because it's running in Docker with its own DNS context, and now you're SSHing into three different machines to update flat files every time a service moves IPs. Reverse proxies specifically depend on stable FQDNs because that's how virtual hosting works — Nginx Proxy Manager matches Host: headers to upstream definitions, and if half your clients are hitting an IP directly or resolving a stale hosts entry, the proxy rule never fires. Traefik has the same dependency; its routing rules are built around hostnames, not IPs.

What a real local resolver actually gives you breaks down into three concrete things. First, a single authoritative source for your internal zone — change an IP in one place and every device on the network picks it up on next TTL expiry. Second, wildcard records: a single *.home.lab A 192.168.1.50 entry means every subdomain you spin up under that zone resolves immediately without touching DNS config again. Third, query latency that doesn't leave your LAN — a bind9 or PowerDNS instance answering from its local zone file returns under 1ms, versus 20-80ms for an upstream query to your ISP or even a local Pi-hole forwarding upstream. For webhook delivery in automation pipelines, that latency difference is irrelevant, but the reliability difference isn't: a misconfigured or unavailable upstream resolver kills webhook callbacks silently.

Split-horizon DNS is where this gets genuinely useful beyond just convenience. The idea is that grafana.home.lab resolves to 192.168.1.50 from inside your network, and either doesn't resolve at all or resolves to a public IP from outside — depending on whether you're also publishing that subdomain publicly. Your router's resolver cannot do this. It has one answer for a name, and that answer is whatever it learned from upstream or from its own static config. A proper authoritative DNS server for your internal zone handles split-horizon natively because it's authoritative for a zone that simply doesn't exist in the public DNS tree. If your home lab is growing toward full automation pipelines, see the guide on Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines — webhook delivery specifically requires that your internal services resolve consistently from wherever the webhook sender lives, which is a DNS problem before it's anything else.

What Each Tool Actually Is (And Isn't)

The single distinction that trips up almost every beginner: authoritative versus recursive DNS are completely different jobs, and neither Bind9 nor PowerDNS does both by default. An authoritative server answers "I own this zone, here's the record." A recursive resolver answers "let me go find that for you." Conflating the two is why home lab setups break in confusing ways — your server responds to dig queries about your own domain but silently refuses everything else.

Bind9

Bind9 is the reference implementation of the DNS protocol — when the RFCs describe how DNS is supposed to behave, Bind9 is usually what they're testing against. It runs as a single daemon (named), reads zone files in the RFC-standard text format (one record per line, SOA at the top, trailing dots required), and by default does nothing but serve those zones authoritatively. Version 9.18.x, which ships in Ubuntu 22.04's standard repos, added significantly better structured logging compared to older 9.16.x builds — you get category-level log channels out of the box instead of a wall of undifferentiated output. The zone file format is unambiguous and version-controllable with plain git, which is its real advantage for home use. The downside: editing zone files by hand means manually incrementing the SOA serial number every time, and forgetting that will cause secondaries to silently ignore your changes.

PowerDNS

PowerDNS ships as two completely separate binaries that do not share configuration files or processes: pdns (the authoritative server) and pdns_recursor. You can run one without the other. The authoritative server's default backend is a SQL database — SQLite for single-node setups, PostgreSQL or MySQL for anything you care about. That database dependency is real overhead, but it means you can update DNS records with a single SQL statement or a REST API call instead of editing a text file, incrementing a serial, and reloading a daemon. For automation — feeding records from a provisioning script or n8n flow — this is a significant practical advantage. The Recursor is a separate project on a separate release cadence; as of this guide, PowerDNS Authoritative is at 4.8.x and pdns_recursor is at 5.x. Do not assume the version numbers track each other.

What Neither Tool Is

These are the things beginners reach for Bind9 or PowerDNS to do, and shouldn't:

DHCP server — that's dnsmasq or Kea. Both Bind9 and PowerDNS will answer DNS queries; neither assigns IP addresses. If you want DNS records to update automatically when a device gets a DHCP lease, you need a separate integration layer.
DNS-over-HTTPS or DNS-over-TLS termination — that's dnsdist (which is actually a PowerDNS project) or a stub resolver like Unbound with TLS wrapping. Neither pdns nor named speaks DoH natively in their standard configurations.
Firewall-level ad blocking — that's Pi-hole, which is a web UI layered on top of dnsmasq (or FTL, their fork of it). Pi-hole blocks ads by returning NXDOMAIN or a null IP for blocklisted domains at the resolver level. Bind9 and PowerDNS can technically be configured to do something similar with RPZ (Response Policy Zones), but that's a significantly more complex setup and not what this guide covers.

Keeping these boundaries clear matters before you pick a tool. If your goal is a local authoritative server for home.lab with predictable, version-controlled zone data, Bind9 is the lower-friction path. If your goal is automation — records managed by scripts, a REST API, or a database you already operate — the PowerDNS authoritative server's SQL backend earns its extra dependency.

Setting Up Bind9: Authoritative Server for Internal Zones

The part that surprises most people setting up Bind9 for the first time: the default package configuration on Debian/Ubuntu ships with recursion enabled for any querying host. That's fine on a locked-down machine with a firewall, but the correct fix isn't to rely on firewall rules — it's to configure Bind9 to not do recursion at all on an authoritative-only server. Open recursive resolvers on home networks get abused for DNS amplification attacks. Lock it down at the application layer first.

`shell

Install bind9 and the companion utilities (dig, named-checkconf, etc.)

apt install bind9 bind9utils bind9-doc

/etc/bind/named.conf.options

options {
directory "/var/cache/bind";

# Only listen on your LAN interface and loopback — not 0.0.0.0
listen-on { 192.168.1.0/24; 127.0.0.1; };

# This server is authoritative-only. No recursion, no forwarding.
allow-recursion { none; };
recursion no;

# Don't answer queries from outside your LAN
allow-query { 192.168.1.0/24; 127.0.0.1; };

dnssec-validation auto;

};
`

After locking down the options, wire up your internal zone in /etc/bind/named.conf.local. The zone declaration just tells Bind9 where to find the zone file — the actual records live separately.

`conf

/etc/bind/named.conf.local

zone "home.lab" {
type master;
file "/etc/bind/db.home.lab";
};

Reverse zone for 192.168.1.x — the name format is counterintuitive

zone "1.168.192.in-addr.arpa" {
type master;
file "/etc/bind/db.192.168.1";
};
`

Now the zone file itself. The trailing dot on FQDNs is the single most common silent failure in Bind9 setups. If you write ns1.home.lab without a trailing dot inside a zone file, Bind9 appends the zone origin, giving you ns1.home.lab.home.lab. Named-checkzone won't always flag this as an error — it's syntactically valid, just wrong. You'll see successful zone loads and completely broken resolution. Add the dot, every time, on any name that's fully qualified.

`plaintext

/etc/bind/db.home.lab

$TTL 3600
@ IN SOA ns1.home.lab. admin.home.lab. (
2024010101 ; Serial — increment this on every edit
3600 ; Refresh
900 ; Retry
604800 ; Expire
300 ) ; Negative TTL

; NS record — trailing dot required on the FQDN
@ IN NS ns1.home.lab.

; A records for your infrastructure
ns1 IN A 192.168.1.10
router IN A 192.168.1.1
nas IN A 192.168.1.20
n8n IN A 192.168.1.30
ollama IN A 192.168.1.32
`

`plaintext

/etc/bind/db.192.168.1

$TTL 3600
@ IN SOA ns1.home.lab. admin.home.lab. (
2024010101
3600
900
604800
300 )

@ IN NS ns1.home.lab.

; PTR records — left side is just the last octet
10 IN PTR ns1.home.lab.
1 IN PTR router.home.lab.
20 IN PTR nas.home.lab.
30 IN PTR n8n.home.lab.
32 IN PTR ollama.home.lab.
`

PTR records aren't optional decoration. SSH builds its known_hosts entries using reverse DNS when it can resolve them, and several Docker healthcheck scripts do reverse lookups to verify they're talking to the right host. Missing PTR records cause subtle, hard-to-diagnose failures that show up weeks after initial setup — especially if you're running anything that validates hostnames against IPs.

Test everything before touching your router's DNS setting. The distinction between SERVFAIL and NXDOMAIN matters here: NXDOMAIN means "zone loaded fine, record doesn't exist"; SERVFAIL means "something is broken in the zone itself." If you're getting SERVFAIL on a name you just added, the zone file has a syntax error and Bind9 is serving stale data or nothing at all.

`shell

Validate config syntax first — exits non-zero on any error

named-checkconf

Validate the zone file specifically

named-checkzone home.lab /etc/bind/db.home.lab

Expected output: zone home.lab/IN: loaded serial 2024010101

OK

named-checkzone 1.168.192.in-addr.arpa /etc/bind/db.192.168.1

Reload after a clean check — don't restart, reload preserves state

systemctl reload bind9

Forward lookup — expect NOERROR + an A record in the ANSWER section

dig @127.0.0.1 ollama.home.lab A

Reverse lookup — expect NOERROR + PTR record

dig @127.0.0.1 -x 192.168.1.32

If you get SERVFAIL here, check: journalctl -u named | tail -20

Look for "zone home.lab/IN: loading from master file failed"

Setting Up PowerDNS: When You Want an API and a Database Backend

The thing that surprises most people about PowerDNS is that it ships as two completely separate binaries — pdns_server (the authoritative server) and pdns_recursor — and you almost always need both running together for a home lab setup to actually work. Get one wrong and clients either can't resolve internal names or get REFUSED for anything outside your domain. That split architecture is worth understanding before you touch a config file.

Start with just the authoritative server and the SQLite3 backend. For a home lab with a handful of zones, there's no reason to spin up PostgreSQL:

`shell
apt install pdns-server pdns-backend-sqlite3

Initialize the schema — PowerDNS ships the SQL file at this path

sqlite3 /var/lib/powerdns/pdns.sqlite3 < /usr/share/doc/pdns-backend-sqlite3/schema.sqlite3.sql
`

Then edit /etc/powerdns/pdns.conf — the default file is heavily commented, which helps. The minimum working config for an API-enabled authoritative server looks like this:

`conf

/etc/powerdns/pdns.conf

launch=gsqlite3
gsqlite3-database=/var/lib/powerdns/pdns.sqlite3

Run on a non-standard port so pdns-recursor can sit in front on :53

local-port=5300
local-address=127.0.0.1

Enable the REST API

api=yes
api-key=yoursecretkey
webserver=yes
webserver-address=127.0.0.1
webserver-port=8081
webserver-allow-from=127.0.0.1
`

Once pdns_server is running, you can create a zone without touching any flat files. This is the actual architectural difference from Bind9 — everything goes through the API or the database, not a text file you reload:

`shell

Create a zone

curl -s -X POST http://localhost:8081/api/v1/servers/localhost/zones \
-H "X-API-Key: yoursecretkey" \
-H "Content-Type: application/json" \
-d '{
"name": "home.arpa.",
"kind": "Native",
"nameservers": ["ns1.home.arpa."]
}'

Add an A record

curl -s -X PATCH http://localhost:8081/api/v1/servers/localhost/zones/home.arpa. \
-H "X-API-Key: yoursecretkey" \
-H "Content-Type: application/json" \
-d '{
"rrsets": [{
"name": "router.home.arpa.",
"type": "A",
"ttl": 300,
"changetype": "REPLACE",
"records": [{"content": "192.168.1.1", "disabled": false}]
}]
}'
`

Now install pdns-recursor as the second process. This one listens on port 53 and handles all upstream recursion — Google, Cloudflare, your ISP — while forwarding your internal zones back to the authoritative server running on 127.0.0.1:5300. Edit /etc/powerdns/recursor.conf:

`conf

/etc/powerdns/recursor.conf

local-address=0.0.0.0
local-port=53

Forward internal zone to the authoritative server

forward-zones=home.arpa.=127.0.0.1:5300

Use upstream resolvers for everything else

forward-zones-recurse=.=1.1.1.1;8.8.8.8
`

The gotcha that burns everyone at least once: the PowerDNS authoritative server will flat-out refuse recursive queries. Point a client directly at port 5300 and ask for google.com — you get REFUSED, not a timeout, not a forwarded answer. REFUSED. This is correct behavior per DNS spec, but it looks like a misconfiguration the first time you see it. The fix is always the same: your clients and your DHCP server should point at the recursor on port 53, never directly at the authoritative server. The authoritative server is an internal implementation detail. If you're also running dnsmasq for DHCP, you can point its upstream at the recursor instead of removing dnsmasq entirely — they compose fine as long as port 53 isn't double-bound.

Bind9 vs PowerDNS: The Honest Operator Comparison

The most useful thing I can tell you upfront: these two servers solve the same core problem with completely different philosophies, and the right choice becomes obvious once you know which direction your setup is heading.

Bind9 to a first working zone is genuinely about 20 minutes. You edit named.conf.local to declare the zone, write the zone file itself, run named-checkzone to verify syntax, and reload:

`conf

/etc/bind/named.conf.local

zone "home.lab" {
type master;
file "/etc/bind/zones/db.home.lab";
};
`

`shell

verify before reloading — catches silent syntax errors

named-checkzone home.lab /etc/bind/zones/db.home.lab

reload without restarting (preserves cache)

rndc reload
`

Two files, one command. PowerDNS's equivalent starting point involves picking a backend (SQLite is the right answer for a single-operator setup), initializing the schema, configuring pdns.conf with the backend path and API key, then understanding that PowerDNS splits authoritative serving and recursive resolution into two separate binaries — pdns_server and pdns_recursor — that need to be configured to talk to each other on different ports. None of this is hard, but it's not 20 minutes either. If you want stable internal DNS and don't need automation, Bind9 is the pragmatic choice.

The calculus flips the moment you want to programmatically register records. Every time a new Docker container comes up, or an n8n webhook endpoint gets created, you do not want to be sed-ing a zone file and calling rndc reload. That path breaks on concurrent writes, requires root or sudo for file edits, and the error surface is wide. A PowerDNS API call is atomic and purpose-built:

`shell

add an A record via PowerDNS REST API — no file manipulation, no reload

curl -s -X PATCH http://localhost:8081/api/v1/servers/localhost/zones/home.lab. \
-H "X-API-Key: your-api-key-here" \
-H "Content-Type: application/json" \
-d '{
"rrsets": [{
"name": "new-container.home.lab.",
"type": "A",
"ttl": 60,
"changetype": "REPLACE",
"records": [{ "content": "192.168.1.45", "disabled": false }]
}]
}'
`

That call returns immediately, the record is live, and you can call it from an n8n HTTP Request node or a Node.js script without touching the filesystem. Bind9 does have nsupdate for dynamic updates via DNS protocol, but it requires TSIG key setup and the tooling is significantly more awkward than a JSON POST.

On resource footprint: Bind9 with a single zone idles around 20–30 MB RSS. PowerDNS authoritative plus recursor plus SQLite backend runs closer to 60–80 MB total across both processes. On any machine already running Docker, this is completely irrelevant — Docker's own overhead dwarfs it. Where it actually matters is a constrained device like a Pi Zero 2W, where you're budgeting memory carefully and a single lean Bind9 process is the cleaner fit.

Factor

Bind9

PowerDNS

Setup to first working zone

~20 min

~40 min

Backend

Flat zone files

SQLite / PostgreSQL / MySQL

REST API

No (rndc + nsupdate only)

Yes, built-in

Recursion

Yes, same process with config

Separate binary (pdns_recursor)

Idle memory (single zone)

~20–30 MB

~60–80 MB (both binaries)

Best for

Static internal zones, fast setup

Dynamic records, scripted automation

Running Either One in Docker Without Losing Your Mind

The single biggest reason DNS containers fail to start has nothing to do with your config files. On Ubuntu 22.04 and Debian 12, systemd-resolved runs a stub listener on 127.0.0.53:53 before your container even tries to bind. You'll see bind: address already in use and spend an hour re-reading your named.conf looking for a typo that isn't there. Fix this first, before you pull any image:

`shell

Stop and permanently disable the stub resolver

sudo systemctl disable --now systemd-resolved

Remove the symlink resolv.conf and write a real one

sudo rm /etc/resolv.conf
echo "nameserver 192.168.1.10" | sudo tee /etc/resolv.conf

Replace 192.168.1.10 with the LAN IP of your Docker host —

that's where your new DNS container will answer queries

After that, verify nothing is still holding port 53 with sudo ss -tulpn | grep ':53'. The output should be empty. Only then does it make sense to bring up a DNS container — otherwise you're just restarting a container that will immediately fail with the same error and a different service name in the log.

For Bind9, the internetsystemsconsortium/bind9:9.18 image is the one to use — it's maintained by ISC, ships with a predictable directory layout, and doesn't include distro patches that quietly change default behavior. The critical networking decision is do not use bridge networking with NAT for a DNS server. ACLs based on source IP break when Docker's NAT layer rewrites the client address. Use network_mode: host, or map ports explicitly and accept that allow-query will see your host IP, not the actual client. Here's a working compose file:

yaml version: "3.9" services: bind9: image: internetsystemsconsortium/bind9:9.18 container_name: bind9 network_mode: host # preserves real client IPs for ACL matching restart: unless-stopped volumes: - ./config/named.conf:/etc/bind/named.conf:ro - ./config/zones:/etc/bind/zones:ro - bind9-cache:/var/cache/bind # No ports: block needed — host mode handles it volumes: bind9-cache:

Your named.conf and zone files live in ./config/ relative to the compose file, mounted read-only. The cache volume is a named Docker volume so query caches survive container restarts without cluttering your host filesystem. One gotcha: Bind9 inside the container runs as the bind user (UID 101 on that image), so if you get permission errors on your zone files, chmod 644 on the files and confirm the mount path matches exactly — a trailing slash difference will cause a silent empty mount.

PowerDNS in Docker has a separate problem: the SQLite backend needs a database file that actually persists, and the default socket directory inside the container is /var/run/pdns, not /run/pdns like the Debian package assumes. If you mount a pdns.conf written for a bare-metal install without adjusting those paths, the control socket won't open and pdns_control commands will fail silently. Mount a named volume for the database and a custom config that corrects both paths:

yaml version: "3.9" services: powerdns: image: powerdns/pdns-auth-48:4.8.3 container_name: powerdns network_mode: host restart: unless-stopped volumes: - ./config/pdns.conf:/etc/powerdns/pdns.conf:ro - pdns-data:/var/lib/powerdns # SQLite db lives here environment: - PDNS_AUTH_API_KEY=changeme volumes: pdns-data:

`conf

pdns.conf — key paths that differ from Debian package defaults

launch=gsqlite3
gsqlite3-database=/var/lib/powerdns/pdns.sqlite3
socket-dir=/var/run/pdns # must match what's inside the container image
local-address=0.0.0.0
local-port=53
api=yes
api-key=changeme
webserver=yes
webserver-address=0.0.0.0
webserver-port=8081
`

Initialize the schema before first start — the container won't create it automatically:

`shell

Run once against the named volume to create the schema

docker run --rm \
-v pdns-data:/var/lib/powerdns \
powerdns/pdns-auth-48:4.8.3 \
pdnsutil create-bind-db /var/lib/powerdns/pdns.sqlite3
`

After that, docker compose up -d should bring PowerDNS up cleanly. Test with dig @127.0.0.1 your.domain A from the host — if you get a REFUSED instead of NXDOMAIN, your allow-query or local-address setting isn't matching the interface the query arrived on, which loops back to the network mode decision above.

What to Point Your Clients At and How to Test It

The most common mistake after getting Bind9 or PowerDNS responding correctly on the server is assuming all your clients will just pick it up. They won't — not until you tell your router to advertise the new DNS server via DHCP. Go into your router's DHCP server config (usually under LAN settings) and replace the default DNS server field with the static IP of your DNS host. Every client that renews its lease after that change will start using your resolver automatically. No touching individual machines, no static DNS entries on laptops you'll forget about. The one gotcha: clients mid-lease won't switch until renewal, so either wait or run sudo dhclient -r && sudo dhclient on Linux, or ipconfig /release && ipconfig /renew on Windows to force it.

Once a client has renewed, the first real test is dead simple:

`shell

From any DHCP client on your LAN — not the DNS host itself

nslookup myservice.home.lab

Expected output:

Server: 192.168.1.53

Address: 192.168.1.53#53

Name: myservice.home.lab

Address: 192.168.1.100

If you see the right IP and the server field shows your DNS host rather than 8.8.8.8, the DHCP push worked. If it still shows 8.8.8.8, the lease hasn't renewed — force it and try again.

Split-horizon verification deserves its own explicit check because misconfigured forwarders are subtle. The goal is three separate behaviors working simultaneously: external names forward out correctly, internal names resolve locally, and your internal names don't leak onto public DNS. Test all three explicitly, not just the one that seems most obvious:

`shell

External name via your resolver — should return real GitHub IPs

dig github.com @192.168.1.53

Internal name via your resolver — should return your LAN A record

dig vault.home.lab @192.168.1.53

Internal name against Google — should return NXDOMAIN, NOT your LAN IP

If this returns an IP, you have a leak or a naming collision

dig vault.home.lab @8.8.8.8
`

The third query is the one people skip, and it's the important one. If vault.home.lab returns anything from 8.8.8.8 other than NXDOMAIN, either someone registered that domain publicly (use a proper RFC 2606 suffix like .internal or .home.arpa to avoid this entirely) or your split-horizon config is wrong.

For ongoing health monitoring, a cron job beats installing another agent. The dig exit code is reliable — it returns non-zero on timeout or SERVFAIL, which covers the most common failure modes: the daemon crashed, the socket isn't listening, or the zone got corrupted. Drop this in your DNS host's crontab:

`shell

/etc/cron.d/dns-health — runs every 5 minutes, mails root on failure

*/5 * * * * root dig +time=2 +tries=1 @127.0.0.1 myservice.home.lab A > /dev/null || \
echo "DNS health check failed at $(date)" | mail -s "DNS DOWN" root
`

If you're already running Uptime Kuma — and if you're doing any home lab monitoring it's worth having — add a DNS monitor pointing at your resolver's IP with the hostname vault.home.lab and expected response set to your LAN IP. Kuma will poll it on whatever interval you set and give you a dashboard entry alongside your HTTP checks. That's better than the cron job for visibility, but the cron job runs even if Kuma itself is down, so keep both.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Vaultwarden 1.36.0: What Changed, What to Patch, and How to Upgrade Without Losing Your Vault

우병수 — Thu, 25 Jun 2026 05:48:58 +0000

TL;DR: The self-hosted threat model inverts the SaaS assumption entirely. With Bitwarden's cloud offering, a dedicated security team monitors infrastructure, rotates credentials on breach indicators, and deploys patches before most users notice a CVE dropped.

📖 Reading time: ~19 min

What's in this article

Why This Update Isn't Optional
What 1.36.0 Actually Patched
Pre-Upgrade Checklist: Back Up Before You Touch Anything
The Actual Upgrade: Docker Compose Steps and Config Changes
Hardening the Instance Beyond the Patch
Monitoring After the Upgrade: Know When Something Goes Wrong
When to Delay an Upgrade (and When You Can't)

Why This Update Isn't Optional

The self-hosted threat model inverts the SaaS assumption entirely. With Bitwarden's cloud offering, a dedicated security team monitors infrastructure, rotates credentials on breach indicators, and deploys patches before most users notice a CVE dropped. With Vaultwarden, you are that team — one person, one instance, zero on-call rotation. Every credential you own, every API key you've vaulted, every SSH passphrase — all of it lives behind whatever you deployed and last touched six months ago.

Vaultwarden 1.36.0 closes specific attack surface in two areas that matter most: the admin panel authentication flow and session token handling. The admin panel in particular has always been an awkward piece of Vaultwarden's security posture — it's protected by a single shared token rather than a proper account, which means any weakness in how that token is validated or how sessions persist becomes an immediate privilege escalation vector. Skipping this update doesn't mean you're running a slightly older version; it means you're running something with a known, now-public vulnerability class that any scanner can probe for. The difference between "unpatched" and "targeted" shrinks the moment a CVE gets indexed.

The highest-risk configuration is also the most common one: Vaultwarden running in Docker, exposed on a public subdomain, sitting behind nginx or Caddy as a reverse proxy. If that describes your setup, the attack surface isn't theoretical. Your admin panel path (/admin) is reachable from the public internet unless you've explicitly blocked it at the proxy layer. The correct posture is to restrict it by source IP or require an additional auth layer — but regardless, you still need the patch. Proxy-layer restrictions reduce exposure; they don't substitute for fixing the underlying vulnerability. If you're on Caddy, something like this belongs in your config alongside the update:

@admin_block {
    path /admin*
    not remote_ip 192.168.1.0/24 10.0.0.0/8
}
respond @admin_block 403

Running Vaultwarden also means your vault stores more than personal passwords — it ends up holding automation credentials, webhook secrets, API tokens for n8n flows, and anything else you've decided a password manager is the right place for. That's a reasonable call, but it means a compromised instance doesn't just leak your email password; it leaks your entire operational secret graph. Managing that surface area — deciding what goes in the vault versus what lives in .env files or a secrets manager — is its own discipline. For broader context on wiring automation credentials across a home lab stack, the Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines guide covers tooling decisions that intersect with this directly.

What 1.36.0 Actually Patched

The most operationally significant fix is the admin panel rate limiting — and the reason it took this long to land properly is instructive. Prior versions delegated rate limiting to whatever reverse proxy sat in front of Vaultwarden, which is fine until you consider that the admin endpoint reads X-Forwarded-For and similar headers to determine the "real" client IP. If your Nginx or Caddy config forwarded those headers without stripping attacker-controlled values, a brute-force attempt against /admin could cycle through IP representations and sidestep any proxy-level rate limit you'd configured. 1.36.0 moves a hard server-side rate limit into Vaultwarden itself, independent of what headers arrive. It applies regardless of proxy configuration. You should still rate-limit at the proxy layer, but you're no longer betting your admin panel on that config being correct.

The session token expiry fix is quieter but arguably more consequential for anyone running Vaultwarden exposed to the internet. The prior behavior had a validation gap where the server didn't consistently enforce token TTL on certain API call paths — a token grabbed from a compromised client could remain usable past the point where the user had changed their password or the session should have expired. The 1.36.0 fix tightens the freshness check on the server side so token expiry is evaluated on every protected API call, not just at login. If you're already rotating your API keys and using short session windows, the practical blast radius of the old behavior was small — but "small" is still non-zero on a credential vault.

The Rust dependency tree also got updated, with openssl and tower-http both bumped to pull in upstream CVE fixes. Neither was a Vaultwarden-specific exploit path — the tower-http issue affected denial-of-service behavior in certain HTTP/1.1 header handling scenarios, and the OpenSSL bump followed standard upstream advisory cadence. Still worth understanding the shape of the change: Vaultwarden uses tower-http as middleware in its Axum-based server stack, so any DoS vector there is a real availability concern for a single-instance self-hosted vault. You can inspect what actually changed in the crate tree yourself:

# from the Vaultwarden source root after pulling 1.36.0
cargo tree --duplicates
cargo audit
# cargo-audit will cross-reference your Cargo.lock against the RustSec advisory database
# install it once with: cargo install cargo-audit

Here's what 1.36.0 did not fix: ADMIN_TOKEN is still a plaintext environment variable by default. If your .env file or Docker Compose config has ADMIN_TOKEN=some_string, that string is sitting in process environment memory and in whatever secrets management (or lack thereof) you've wired up. Vaultwarden has supported bcrypt-hashed tokens for a while — generate one with echo -n "your_token" | argon2 $(openssl rand -base64 16) -id -t 3 -m 16 -p 4 -l 32 or more simply via the vaultwarden hash subcommand — but the default experience hasn't changed. If you're pulling 1.36.0 as an opportunity to audit your deployment, converting to a hashed token and injecting it from a secrets manager (Docker secrets, Vault, even just a chmod 600 env file mounted read-only) is the higher-use move than the patches themselves for most self-hosted setups.

Pre-Upgrade Checklist: Back Up Before You Touch Anything

The backup step most people skip is the SQLite-level one, not the volume copy. Volume copies preserve file state, but if SQLite was mid-write when you ran cp -r, you've got a corrupt backup you won't discover until you need it. The safe sequence starts with SQLite's own backup mechanism inside the running container:

# Flush WAL and take a consistent snapshot inside the container
docker exec vaultwarden sqlite3 /data/db.sqlite3 '.backup /data/db_backup.sqlite3'

# Then tar the whole data directory with a timestamp
docker run --rm \
  --volumes-from vaultwarden \
  -v $(pwd)/backups:/backups \
  alpine tar czf /backups/vaultwarden-$(date +%Y%m%d-%H%M%S).tar.gz /data

The .backup pragma runs a hot backup through SQLite's API — it handles page-level locking and copies a consistent snapshot even while Vaultwarden is live. The volume tar that follows captures attachments, RSA keys, config files, and the icon cache. Both steps together take under ten seconds on a typical self-hosted instance, and skipping either one is how people end up with a vault full of encrypted blobs and no way to decrypt them.

Before you touch the image, verify the backup is actually readable — not just present on disk. Copy db_backup.sqlite3 somewhere outside the container and run:

sqlite3 db_backup.sqlite3 'PRAGMA integrity_check;'
# Expected output: ok

Anything other than a single ok means the file is damaged and you should re-run the backup before proceeding. If you have a second machine handy, opening the file there rules out filesystem-level corruption on your host. This takes 30 seconds and is the difference between a recoverable mistake and a complete vault loss.

Pin down exactly what you're upgrading from before pulling 1.36.0. Vaultwarden has had migration issues when minor versions are skipped — the database schema migrations are sequential, and jumping too far has caused apply-order errors in past releases:

# Check the currently running image tag
docker inspect vaultwarden | grep -i image

# Or if you're using compose
docker compose ps --format json | jq '.[].Image'

Confirm you're on 1.35.x. If you're on something older, check the Vaultwarden GitHub releases page for any migration notes between your version and 1.36.0 before pulling. Separately, dump your current docker-compose.yml environment block — especially ADMIN_TOKEN, DOMAIN, and all SMTP_* variables — into a scratch file. These values survive the upgrade cleanly, but a compose rewrite during the update is a reliable way to accidentally clear a variable and spend an hour debugging why admin panel access is broken or outbound email has gone silent.

The Actual Upgrade: Docker Compose Steps and Config Changes

The pull-and-restart sequence matters more than most upgrade guides acknowledge. Running docker compose down before pulling creates a window where your Vaultwarden instance is completely unreachable — not just syncing slowly, but returning connection refused. Bitwarden clients handle this gracefully on the next sync cycle, but browser extensions in active use will throw errors until they retry. The cleaner approach:

# Pull the new image without stopping the running container
docker compose pull vaultwarden

# Recreate only the vaultwarden service — compose handles the stop/start atomically
docker compose up -d vaultwarden

This gives you DNS-level continuity. The old container keeps serving until the new one is ready, and clients that were mid-sync reconnect on their next cycle without user-visible errors. On a typical VPS with a cached layer pull, the gap between old container stopping and new container accepting connections is under five seconds.

The one config change that will silently break rate limiting if you skip it: 1.36.0 made IP header handling explicit. If your reverse proxy (nginx, Caddy, Traefik — all common in front of Vaultwarden) passes X-Forwarded-For or X-Real-IP and you never set IP_HEADER in your environment, the rate limiter may be keying on the proxy's internal IP instead of the real client IP. That means every client looks like the same source, and you hit rate limits in unexpected ways. Add this to your .env or docker-compose.yml:

# In your .env file — match this to what your specific proxy actually sends
IP_HEADER=X-Forwarded-For
# OR if you're using nginx with proxy_set_header X-Real-IP $remote_addr:
# IP_HEADER=X-Real-IP

If you're not sure which header your proxy sends, check your nginx or Caddy config. Caddy sets X-Forwarded-For by default. Nginx only sets it if you've added proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for explicitly — many minimal configs omit this and use X-Real-IP instead.

Now that 1.36.0 actually enforces admin panel protections meaningfully, two hardening steps are worth doing on this upgrade pass rather than deferring. First, stop using a plaintext ADMIN_TOKEN. The official supported method since 1.28+ uses argon2:

# Generate the hash — substitute your actual password for 'yourpassword'
echo -n 'yourpassword' | argon2 vaultwarden -t 3 -m 15 -p 4 -l 32 -e

# Output looks like: $argon2id$v=19$m=32768,t=3,p=4$...
# Paste the full output as your ADMIN_TOKEN value in .env
# Wrap it in single quotes — the $ characters will break unquoted shell expansion
ADMIN_TOKEN='$argon2id$v=19$m=32768,t=3,p=4$$'

Second, restrict /admin at the reverse-proxy layer regardless of token strength. In nginx, this is a two-line addition to your server block:

location /admin {
    # Allow only your LAN CIDR — adjust to match your actual network
    allow 192.168.1.0/24;
    deny all;
    proxy_pass http://vaultwarden:80;
    # ...your existing proxy headers
}

After the upgrade, the smoke test takes about ninety seconds and catches the most common failure modes before your password manager goes dark mid-day. Log into the web vault directly, then hit the admin panel. Then:

# Clean startup shows version string and listening address — no ERROR lines
docker logs vaultwarden --tail 50 | grep -E "(ERROR|Vaultwarden|Listening)"

# Expected output on a healthy start:
# [INFO] Vaultwarden version 1.36.0 ...
# [INFO] Listening on 0.0.0.0:80

If you see ERROR lines around IP header parsing or rate limiter initialization, that's almost always the missing IP_HEADER env var. If the admin panel returns 403 immediately, double-check that your argon2 hash is quoted correctly in the env file — unquoted dollar signs in bash will silently mangle the hash value before Docker ever sees it.

Hardening the Instance Beyond the Patch

Before anything else: if SIGNUPS_ALLOWED is still set to true on a single-user instance, stop reading and fix that first. Every minute that flag is true, your instance is an open registration endpoint. Set it to false in your .env or Docker Compose environment block, restart the container, and verify it stuck by hitting /register in a browser — you should get a hard block, not a form.

The 1.36.0 rate limiter is internal to Vaultwarden, but internal rate limiting is your last line of defense, not your first. Push the enforcement upstream to your reverse proxy where connections can be dropped before they touch the app. The two endpoints worth locking down are /api/accounts/prelogin and /api/accounts/login — both are credential-stuffing targets because they're unauthenticated and return distinguishable responses. For Nginx:

# Define the zone once in http {} block
limit_req_zone $binary_remote_addr zone=vw_auth:10m rate=5r/m;

server {
    location /api/accounts/prelogin {
        limit_req zone=vw_auth burst=3 nodelay;
        proxy_pass http://127.0.0.1:8080;
    }

    location /api/accounts/login {
        limit_req zone=vw_auth burst=3 nodelay;
        proxy_pass http://127.0.0.1:8080;
    }
}

If you're on Caddy, the rate_limit directive from the caddy-ratelimit module covers the same surface. The directive isn't in the standard Caddy binary — you'll need a custom build or the xcaddy approach. The equivalent config limits to 5 requests per minute per IP on those two routes and rejects with a 429 before Caddy ever proxies the request. Either way, the proxy-level block means the connection dies at the edge; the Vaultwarden rate limiter becomes a redundant backstop rather than the primary gate.

Fail2ban can parse Vaultwarden's log output, but the Docker logging path needs explicit setup. Vaultwarden doesn't write to a file by default when running in Docker — you need to redirect docker logs output to a file, or configure the container's logging driver to write to /var/log/vaultwarden/. The easiest approach without changing the logging driver is a log forwarder or simply setting --log-driver json-file with a known path in your Compose file, then symlinking or tailing into Fail2ban's watched directory. The filter pattern itself is straightforward:

# /etc/fail2ban/filter.d/vaultwarden.conf
[Definition]
failregex = ^.*Username or password is incorrect\. Try again\. IP: .*$
ignoreregex =

# /etc/fail2ban/jail.d/vaultwarden.conf
[vaultwarden]
enabled  = true
port     = http,https
filter   = vaultwarden
logpath  = /var/log/vaultwarden/vaultwarden.log
maxretry = 5
findtime = 600
bantime  = 1800

The 5-attempt/10-minute threshold (findtime = 600) is conservative enough to catch automated stuffing without locking out a legitimate user who fat-fingered their master password twice. Adjust bantime upward if you're seeing repeat offenders — 30 minutes is a reasonable starting point, not a ceiling.

On secrets management: if you're running an automation layer like n8n that reads Vaultwarden credentials or calls the admin API, the ADMIN_TOKEN hash sitting in a .env file that gets committed to a repo (even a private one) is a real exposure. Docker secrets solve this without requiring a full secrets manager. In Compose v3.7+:

# docker-compose.yml
services:
  vaultwarden:
    image: vaultwarden/server:1.36.0
    environment:
      # Reference the secret file path instead of embedding the value
      ADMIN_TOKEN_FILE: /run/secrets/vw_admin_token
    secrets:
      - vw_admin_token

secrets:
  vw_admin_token:
    file: ./secrets/vw_admin_token.txt  # lives outside the repo root

Vaultwarden 1.36.0 supports the _FILE suffix convention for reading secrets from mounted files — verify against the current docs since this behavior can shift between minor releases. The secrets/ directory goes in your .gitignore, and the value itself never touches the Compose file. If your n8n flows use the Vaultwarden admin API, give those flows their own scoped token if the admin API ever gains per-token scoping — for now, the Docker secret approach at least keeps the hash off disk in plaintext and out of environment variable dumps.

Monitoring After the Upgrade: Know When Something Goes Wrong

The upgrade itself is five minutes of work. The part that actually protects you is what runs the other 99.9% of the time. Vaultwarden 1.36.0 added distinct log entries for rate-limit triggers — where earlier versions would silently drop requests or log a generic rejection, you now get a structured line you can grep for. That single change makes brute-force detection possible without a full SIEM. The minimum viable version: a cron job that runs every hour, counts occurrences of rate limit exceeded in the container log, and fires a webhook if the count crosses your threshold.

#!/bin/bash
# /opt/scripts/check_ratelimit.sh
# Counts rate-limit log entries from the last 60 minutes
THRESHOLD=10
CONTAINER="vaultwarden"
LOGFILE="/var/log/vaultwarden_ratelimit_check.log"

COUNT=$(docker logs --since=1h "$CONTAINER" 2>&1 | grep -c "rate limit exceeded")

if [ "$COUNT" -gt "$THRESHOLD" ]; then
  curl -s -X POST "$WEBHOOK_URL" \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"Vaultwarden: $COUNT rate-limit events in the last hour\"}"
  echo "$(date) ALERT: $COUNT events" >> "$LOGFILE"
fi

Drop that in /etc/cron.d/ on an hourly schedule and you have an effective brute-force early warning with zero extra infrastructure. If you're already running Grafana with Loki, point a log stream at the container and create an alert on the query {container="vaultwarden"} |= "rate limit exceeded" — you'll get rate-over-time graphing as a bonus, which is genuinely useful for distinguishing a scanning bot from a misconfigured mobile client hammering the sync endpoint.

For uptime monitoring, two checks cover most failure modes. A TCP check on the Vaultwarden port (default 3012 for WebSocket, 80/443 through your reverse proxy) will catch container crashes within seconds. But a container can stay up and the app can wedge — the event loop stalls, DB connections exhaust, whatever. That's what /alive is for. Vaultwarden exposes this health endpoint that returns a 200 when the app is actually responding. In Uptime Kuma, set up an HTTP keyword monitor pointed at https://your-vault-domain/alive, check interval 60 seconds, keyword OK. The TCP check catches hard crashes; the HTTP check catches soft failures the TCP check misses entirely.

Backup verification is where most self-hosters have a gap. Copying a SQLite file to S3 every night is not a backup strategy — it's a hope strategy. The actual test is whether that file can be opened and passes integrity checks. This is one place where my n8n setup earns its keep: a weekly workflow that pulls the latest backup, runs sqlite3 db.sqlite3 "PRAGMA integrity_check;", and posts the result to a webhook. If the output is anything other than ok, the flow sends an alert.

# Run from a cron or n8n Execute Command node
BACKUP_PATH="/mnt/backups/vaultwarden/db_$(date +%Y%m%d).sqlite3"

# -batch: non-interactive mode, -init /dev/null: skip .sqliterc
RESULT=$(sqlite3 -batch -init /dev/null "$BACKUP_PATH" "PRAGMA integrity_check;" 2>&1)

if [ "$RESULT" = "ok" ]; then
  echo '{"status":"ok","file":"'"$BACKUP_PATH"'"}'
else
  echo '{"status":"FAILED","output":"'"$RESULT"'"}'
fi

And when something does go wrong with the upgrade itself, the rollback is mechanical if you prepared correctly. The full sequence:

docker compose down — stops the container cleanly, doesn't touch the data volume
Edit docker-compose.yml: change vaultwarden/server:1.36.0 back to vaultwarden/server:1.35.0
Restore the pre-upgrade SQLite snapshot to the data volume: cp /mnt/backups/vaultwarden/pre_upgrade_db.sqlite3 /opt/vaultwarden/data/db.sqlite3
docker compose up -d

Clients reconnect and see their vault exactly as it was before the upgrade. The critical dependency is that backup being current — which is exactly why the integrity check and the pre-upgrade snapshot aren't optional steps you do when you remember. The rollback only works cleanly if the backup was taken after a clean shutdown and before the new container ever touched the database file. If you let 1.36.0 run migrations and then try to roll back to a stale backup, you will lose data written between the migration and the restore. Shut down, copy, then bring up the new version. That order is non-negotiable.

When to Delay an Upgrade (and When You Can't)

The 24-48 hour hold is a legitimate strategy for self-hosted software — but you have to actually use it, not just wait randomly. The Vaultwarden GitHub issues tab fills up fast after a release. Early adopters running unusual configs (SQLite on ARM, non-standard SMTP setups, Cloudflare Tunnel proxies) will surface startup panics or migration failures within hours. A silent database schema change that works fine on Postgres 16 can corrupt a SQLite WAL file in an edge case that nobody caught in testing. Waiting a day and skimming the open issues costs you nothing unless you're actively being attacked.

1.36.0 is one of the releases where the hold strategy breaks down for a specific subset of operators: anyone with the admin panel exposed directly to the internet without an additional auth layer in front of it. The rate-limiting fix addresses a real attack surface that exists right now. If your /admin path is reachable without IP allowlisting, a Cloudflare Access rule, or at minimum HTTP basic auth at the reverse proxy, you're not in a position to wait. The version of Vaultwarden running on that box is already a target. Update, then harden the ingress.

Version pinning by digest is how you get the best of both worlds — controlled rollout without riding latest into whatever the next release drops. Pull the digest after you've confirmed the release is stable:

# Get the digest for a specific tag after it's been out 48 hours
docker pull vaultwarden/server:1.36.0
docker inspect --format='{{index .RepoDigests 0}}' vaultwarden/server:1.36.0
# outputs something like:
# vaultwarden/server@sha256:a3f9c2e1...

# Use that in your compose file, not the tag
image: vaultwarden/server@sha256:a3f9c2e1d4b8f7e6c5d2a1b9e3f8c7d6e5b4a3f2e1d0c9b8a7f6e5d4c3b2a1f0

The tradeoff: pinning by digest means you will not accidentally move, which is good. It also means you have to actually watch the release feed — GitHub releases RSS, the Vaultwarden Discord, or the GitHub tags API — otherwise you'll sit on a vulnerable version indefinitely and feel safe because your compose file looks deliberate. Set a calendar reminder or wire up a GitHub Actions workflow that pings you when a new tag appears on the repo.

The update cadence issue is subtler and most operators miss it entirely. Vaultwarden ships its own build of the Bitwarden web vault client alongside the server binary. A security-relevant fix in the upstream web vault JavaScript — a token handling bug, a CSP regression, a change in how the client validates server responses — can land in a Vaultwarden release without triggering a CVE assignment or a "security" label on the GitHub release. The release notes will mention the upstream web vault version bump, but you have to read them to catch it. Treating version tags as the only signal and skipping the actual changelog is how you miss half the security-relevant changes in any given release.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Monitoring a Kubernetes Cluster with OpenTelemetry Collector: Agent + Gateway Pattern That Actually Works

우병수 — Wed, 24 Jun 2026 07:44:10 +0000

TL;DR: The single-collector setup is almost always how this starts: one DaemonSet or Deployment running a monolithic OpenTelemetry Collector that scrapes node metrics, receives traces from instrumented apps, and ships everything to your backend. It works fine right up until it doesn't.

📖 Reading time: ~21 min

What's in this article

The Problem: One Collector Per Cluster Is a Reliability Trap
Architecture Before You Touch a YAML File
Deploying the Agent DaemonSet: Config and Gotchas
Deploying the Gateway: Batching, Retries, and Memory Limiter
Kubernetes-Specific Receivers Worth Enabling
Sizing, Resource Reality, and When This Pattern Breaks Down
Validating the Stack End-to-End Before You Trust It

The Problem: One Collector Per Cluster Is a Reliability Trap

The single-collector setup is almost always how this starts: one DaemonSet or Deployment running a monolithic OpenTelemetry Collector that scrapes node metrics, receives traces from instrumented apps, and ships everything to your backend. It works fine right up until it doesn't. One pipeline misconfiguration — say, a regex scrub rule that burns CPU on high-cardinality trace attributes — and the collector process OOMs. Every metric, every trace, every log from every node: gone simultaneously. There's no blast radius containment because there's no architectural separation.

The failure mode is especially sharp on self-hosted and home-lab clusters where you're not running multiple collector replicas behind a load balancer. A single pod with a memory limit of 512Mi hits a traffic spike during a deployment, the kernel OOM killer fires, and you lose observability exactly when you needed it most — during the rollout you were trying to watch. The monolithic collector is a single point of failure dressed up as a convenience.

The agent + gateway split fixes this structurally. Agents run as a DaemonSet — one collector pod per node — and their only jobs are local scraping (kubelet metrics, cAdvisor, node exporters) and short-term buffering. They're intentionally lightweight. The gateway runs as a separate Deployment (typically two replicas minimum) and handles everything expensive: batching, retry logic, tail sampling, and the actual export to Prometheus remote write, Grafana Cloud OTLP, or wherever your backend lives. A crashed agent takes out observability for one node. A crashed gateway replica doesn't take out collection — agents keep buffering locally until the gateway recovers.

# agent config: scrape local kubelet, forward to gateway only
exporters:
  otlp:
    endpoint: "otel-gateway.monitoring.svc.cluster.local:4317"
    tls:
      insecure: true  # internal cluster traffic, terminate TLS at gateway

receivers:
  kubeletstats:
    collection_interval: 15s
    auth_type: serviceAccount
    endpoint: "${K8S_NODE_NAME}:10250"
    insecure_skip_verify: true

service:
  pipelines:
    metrics:
      receivers: [kubeletstats]
      exporters: [otlp]  # never export to backend directly from agent

The concern separation also makes debugging tractable. If your Prometheus remote write endpoint starts rejecting samples, you debug the gateway config in one place instead of hunting through every node's collector logs. If one node's kubelet metrics are spiking cardinality, you adjust that agent's pipeline without touching anything else. This maps cleanly to how other automation systems isolate concerns — the same logic applies if you're wiring observability signals into external workflows; Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines covers how those handoff points work when your cluster signals need to trigger downstream pipeline steps.

One gotcha that doesn't appear in the OpenTelemetry Collector docs until you've already been burned: agents buffering to disk (via the file_storage extension) need a PersistentVolume or at minimum a hostPath mount, otherwise a pod restart loses whatever was queued. Most example configs skip this entirely and leave you with in-memory queuing only — fine for traces with short TTLs, quietly catastrophic for metrics you're trying to preserve across a gateway outage.

Architecture Before You Touch a YAML File

The pattern that actually works at scale splits responsibility along a hard boundary: agents are dumb collectors with tight resource limits, the gateway is where all the expensive processing happens. Conflating these two roles into a single collector deployment is the most common mistake I see in OTel setups — you either end up with fat DaemonSet pods starving your workloads, or you lose the HA guarantees you need on the processing side.

The agent runs as a DaemonSet — one pod per node, no exceptions. Its job is narrow: pull kubelet metrics via the hostMetrics receiver, tail container logs via the filelog receiver, and accept spans from app pods on localhost over OTLP gRPC port 4317. That last part is important — pods send to localhost:4317 because the agent is guaranteed to be on the same node, which avoids cross-node hops for span traffic. Processing on the agent should be near-zero: maybe a memory_limiter and a basic resourcedetection processor for node-level labels, nothing else. A 128Mi memory limit is realistic for this role and you should enforce it. Agents that balloon past that are usually doing work they shouldn't be doing.

# agent memory_limiter — keep it lean
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 110        # soft ceiling below the 128Mi pod limit
    spike_limit_mib: 20   # headroom for burst without OOMKill

  resourcedetection:
    detectors: [env, k8snode]  # node name + any env vars you inject
    timeout: 5s
    override: false

The gateway is a regular Deployment — run at least two replicas for HA, with a HorizontalPodAutoscaler if your span volume is spiky. It receives OTLP from every agent over gRPC, and this is where the real pipeline work happens: k8sattributes processor to enrich with pod/namespace/deployment labels, a batch processor to reduce backend write pressure, and tail-based sampling if you're running traces at any meaningful volume. All your exporters — Prometheus remote write, Jaeger, Loki — live here, not on the agent. The gateway is the single egress point, which means you can change backends without redeploying anything on your nodes.

Traffic flow is strictly unidirectional: pod → agent (4317) → gateway (4317) → backend. That unidirectionality is load-bearing. The moment you introduce a feedback loop — say, a gateway that tries to scrape something that also pushes to the gateway — you get circular pipeline failures that are genuinely hard to debug because they only manifest under backpressure. The Prometheus Operator is the reason a lot of teams end up in this trap: they already have it running, it handles metrics fine, so they bolt OTel on top without rethinking the flow. The fundamental problem with stopping at the Prometheus Operator is that it doesn't handle traces or logs at all, and its scrape interval model (default 15s, minimum practical ~5s) is the wrong primitive for span data where you need sub-second decisions on sampling. The OTel Collector handles all three signals — metrics, logs, traces — in one binary with one config surface, which is why the agent/gateway split on top of it beats a hybrid Prometheus + something-else approach once you have more than one signal to care about.

Deploying the Agent DaemonSet: Config and Gotchas

The most expensive mistake you can make with the agent tier is under-provisioning memory on nodes that run GPU workloads. Set resources.limits.memory: 256Mi and resources.requests.memory: 128Mi — GPU pods on the same node will spike memory pressure during model loading, and a collector agent with no headroom will get OOM-killed silently. You won't see a crash loop because the DaemonSet restarts it immediately, but you'll have gaps in your metrics that look like a scrape timing issue until you check kubectl describe pod on the agent and see the OOM eviction history.

Install with the official Helm chart at version 0.97 or later — earlier versions had issues with the filelog receiver's multiline parsing defaults that caused partial log lines to flush incorrectly:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm upgrade --install otel-agent open-telemetry/opentelemetry-collector \
  --namespace monitoring \
  --create-namespace \
  --version 0.97.0 \
  --values agent-values.yaml

The agent-values.yaml core receiver block looks like this in practice. The kubeletstats receiver endpoint using the node name env var is not optional — hardcoding a node IP breaks pod scheduling entirely:

mode: daemonset

resources:
  limits:
    memory: 256Mi
    cpu: 200m
  requests:
    memory: 128Mi
    cpu: 50m

config:
  receivers:
    hostmetrics:
      collection_interval: 30s
      scrapers:
        cpu: {}
        memory: {}
        network: {}
        filesystem:
          # exclude tmpfs and overlay to avoid noise from container layers
          exclude_mount_points:
            match_type: regexp
            mount_points: ["/dev.*", "/sys.*", "/proc.*", "overlay", "tmpfs"]

    kubeletstats:
      collection_interval: 30s
      auth_type: serviceAccount
      # requires K8S_NODE_NAME env var injected via fieldRef
      endpoint: "https://${env:K8S_NODE_NAME}:10250"
      insecure_skip_verify: true
      metric_groups:
        - node
        - pod
        - container

    filelog:
      include:
        - /var/log/pods/*/*/*.log
      include_file_path: true
      include_file_name: false
      operators:
        - type: kubernetes_metadata
          # enriches log records with pod name, namespace, container name, labels
        - type: json_parser
          timestamp:
            parse_from: attributes.time
            layout: "%Y-%m-%dT%H:%M:%S.%LZ"

  exporters:
    otlp:
      endpoint: otel-gateway.monitoring.svc.cluster.local:4317
      compression: gzip
      sending_queue:
        enabled: true
        num_consumers: 4
        queue_size: 300
      retry_on_failure:
        enabled: true
        initial_interval: 5s
        max_interval: 30s

  service:
    pipelines:
      metrics:
        receivers: [hostmetrics, kubeletstats]
        exporters: [otlp]
      logs:
        receivers: [filelog]
        exporters: [otlp]

The kubeletstats 403 gotcha will burn you if you assume the Helm chart handles RBAC completely. It creates a ServiceAccount and basic Role, but does not create a ClusterRole for nodes/stats and nodes/proxy. The failure mode is subtle: the collector starts successfully, the DaemonSet reports Ready, but kubeletstats silently drops metrics. You'll only see it if you're watching collector logs directly — it surfaces as repeated failed to scrape: 403 Forbidden lines, not as a container that won't start. Create this manually:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-agent-kubeletstats
rules:
  - apiGroups: [""]
    resources:
      - nodes/stats
      - nodes/proxy
      - nodes/metrics
    verbs: ["get", "list", "watch"]
  # needed by kubernetes_metadata operator in filelog
  - apiGroups: [""]
    resources: [pods, namespaces]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-agent-kubeletstats
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-agent-kubeletstats
subjects:
  - kind: ServiceAccount
    name: otel-agent-opentelemetry-collector  # default name from Helm chart
    namespace: monitoring

The sending_queue size of 300 on the OTLP exporter is deliberate. Gateway restarts during upgrades or rollouts can take 15–30 seconds, and at 30-second collection intervals across a node with 40+ pods, you'll queue up a meaningful number of metric batches before the gateway is back. 300 entries covers that window without letting the agent balloon in memory — the queue is in-memory, so a 300-item queue on a busy node adds roughly 20–30MB depending on cardinality. If your nodes run dense, high-cardinality workloads, watch the agent's own memory usage via kubectl top pod in the monitoring namespace and tune the limit up before hitting the OOM wall.

Deploying the Gateway: Batching, Retries, and Memory Limiter

The most counterintuitive thing about the gateway's memory_limiter processor is that its position is non-negotiable — it must be first in every pipeline, before batching, before sampling, before anything else. If you put it second, the batch processor has already allocated memory for a full buffer before the limiter gets a chance to refuse incoming data. By the time the limiter fires, you're already in an OOM spiral on a pod with a 512Mi limit. The processor ordering in OTEL Collector config isn't cosmetic; the runtime processes them in array order.

Here's the Helm values block I use for the gateway deployment:

# values-gateway.yaml
mode: deployment
replicaCount: 2

resources:
  limits:
    memory: 512Mi
    cpu: "1"
  requests:
    memory: 256Mi
    cpu: "250m"

config:
  processors:
    memory_limiter:
      limit_mib: 400
      spike_limit_mib: 80
      check_interval: 5s

    batch:
      send_batch_size: 8192
      send_batch_max_size: 10000
      timeout: 10s

  service:
    pipelines:
      metrics:
        processors: [memory_limiter, batch]  # order matters
      traces:
        processors: [memory_limiter, tail_sampling, batch]

The spike_limit_mib: 80 gives you a cushion between the soft limit (400Mi) and the hard container limit (512Mi). Without that headroom, a sudden burst of spans can push the process over the cgroup limit before the check interval fires. With check_interval: 5s, the limiter polls every five seconds — coarse, but cheap. If you set it to 1s you'll see CPU overhead in the collector metrics itself.

On batching: the collector's default send_batch_size is 8192 items per the current docs, but depending on the chart version you're pulling, it may default to something far smaller — I've seen deployments shipping 200-item batches to a Prometheus remote write endpoint, which means hundreds of HTTP POSTs per minute for any reasonably active cluster. That hammers Mimir's ingester with tiny requests and inflates your HTTP overhead ratio badly. The config above sets send_batch_max_size: 10000 as a ceiling so a single goroutine flush can't send an unbounded payload if a queue backs up, and timeout: 10s forces a flush even if the batch isn't full — useful during low-traffic windows so metrics don't sit in the buffer stale.

For Prometheus remote write to a self-hosted Mimir instance, the exporter config looks like this:

exporters:
  prometheusremotewrite:
    endpoint: http://mimir.monitoring.svc:9009/api/v1/push
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    # tls is optional if you're terminating inside the cluster mesh
    tls:
      insecure: true

The retry_on_failure block is the part most people skip on first deploy and then regret. Without it, a Mimir pod restart or a brief network hiccup between namespaces silently drops whatever was in the exporter's send queue. With max_elapsed_time: 300s, the collector will retry for up to five minutes before giving up on a batch — which covers most rolling restarts cleanly. Note that retried data is held in memory, so size your limit_mib with that in mind if your backend goes down for extended periods.

Tail-based sampling is where the gateway topology gets sticky. The tail_sampling processor needs to see all spans from a single trace before it can make a keep/drop decision — so if trace abc123 has spans landing on both gateway pod replicas, neither pod has enough information to evaluate the policy. The standard approach is sticky routing from the agent side using the loadbalancing exporter, which hashes on trace ID:

# agent config — routes traces to gateway by trace ID hash
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway.monitoring.svc.cluster.local
        port: 4317

With this, every span for a given trace ID routes to the same gateway replica regardless of which node the agent pod lives on. Then on the gateway, the tail sampling policy can do the useful thing — keep 100% of traces that contain at least one error span, and sample down to 5% of clean traces:

processors:
  tail_sampling:
    decision_wait: 10s   # wait up to 10s for all spans before deciding
    num_traces: 50000    # in-memory trace buffer; tune to your trace volume
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: sample-healthy
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

The decision_wait: 10s means the processor holds spans in memory for up to 10 seconds before committing a sampling decision. Combined with num_traces: 50000, you can do the math on how much memory this buffer consumes relative to your average spans-per-trace count. High-cardinality microservice traces with 200+ spans per trace will eat that budget much faster than a simple 3-service path. If you see the gateway's heap climbing toward the limit_mib threshold under load, reducing num_traces or tightening decision_wait is the first dial to turn.

Kubernetes-Specific Receivers Worth Enabling

The receiver placement question trips up almost every first-time OTel deployment on Kubernetes: the k8s_cluster receiver belongs on the gateway, full stop. It talks directly to the Kubernetes API server to pull pod phase transitions, node conditions, and deployment replica counts — that's cluster-wide state, not node-local telemetry. If you accidentally run it on every agent DaemonSet pod (easy to do when you're copying config between components), you'll end up with the same metric emitted from a dozen sources simultaneously, each with a slightly different collection timestamp. Prometheus or your OTLP backend will either deduplicate them incorrectly or surface false gaps. Run exactly one instance on the gateway, and give that gateway pod a service account with list and watch on pods, nodes, namespaces, and replicasets.

receivers:
  k8s_cluster:
    collection_interval: 30s
    node_conditions_to_report:
      - Ready
      - MemoryPressure
      - DiskPressure
    allocatable_types_to_report:
      - cpu
      - memory
    # auth_type defaults to serviceAccount — works inside the cluster
    # Do NOT add this block to your agent config

The k8sevents receiver is the sleeper feature most people ignore until they've spent an hour grepping through logs for why a pod won't start. It surfaces Kubernetes events — OOMKills, BackOff loops, failed scheduling decisions, image pull failures — as structured log records with resource attributes. Route these into Loki with a pipeline that tags them by k8s.namespace.name and k8s.pod.name, and suddenly you can run a single LogQL query that shows both the application's stdout and the Kubernetes event that preceded the crash. The key config detail is the auth_type: serviceAccount plus a ClusterRole that grants get and list on the events resource. Without the resource attribute filter on the Loki exporter side, events from all namespaces land in the same stream and become useless noise.

receivers:
  k8sevents:
    auth_type: serviceAccount

exporters:
  loki:
    endpoint: http://loki-gateway:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true
    # resource attributes promoted to Loki stream labels:
    resource_to_labels_mappings:
      k8s.namespace.name: namespace
      k8s.object.kind: kind
      k8s.object.name: object_name

service:
  pipelines:
    logs/k8sevents:
      receivers: [k8sevents]
      processors: [memory_limiter, batch]
      exporters: [loki]

The resourcedetection processor with detectors: [k8snode, env] is the piece that makes cross-signal correlation actually work in Grafana. Without it, a metric arrives tagged with a pod name but nothing that tells you which node it ran on or which cloud region that node lives in. The k8snode detector reads the node name from the downward API (you expose it as K8S_NODE_NAME env var on the agent pod), then enriches every span, metric, and log with k8s.node.name. The env detector picks up whatever you've set in OTEL_RESOURCE_ATTRIBUTES — this is where you inject cluster name since there's no native detector for it yet. Drop this processor early in every pipeline on both the agent and gateway, or you'll find metrics that correlate fine within a single node but fall apart when you try to build a node-comparison dashboard.

processors:
  resourcedetection:
    detectors: [k8snode, env]
    k8snode:
      # reads NODE_NAME from downward API env injection
      node_from_env_var: K8S_NODE_NAME
    timeout: 2s
    override: false  # don't overwrite attributes set by the app itself

# In your agent DaemonSet spec:
# env:
#   - name: K8S_NODE_NAME
#     valueFrom:
#       fieldRef:
#         fieldPath: spec.nodeName
#   - name: OTEL_RESOURCE_ATTRIBUTES
#     value: "k8s.cluster.name=prod-us-east-1"

One gotcha with the k8s_cluster receiver: it emits metrics under the k8s. namespace but the exact metric names shifted between OTel Collector versions around 0.80. If you're referencing dashboards built against an older collector, you may find deployment replica metrics named k8s.deployment.desired vs k8s.deployment.desired_pods depending on what version generated them. Pin your collector image tag explicitly and validate metric names with a one-off debug export to debug exporter (verbosity: detailed) before wiring up Grafana panels — saves a lot of "why is this panel empty" time.

Sizing, Resource Reality, and When This Pattern Breaks Down

The numbers here are smaller than most blog posts admit. On a three-node home-lab cluster — 8 cores, 32GB RAM per node — the agent DaemonSet pods sit at roughly 50–80Mi RSS and 0.05–0.1 CPU cores at steady state with the k8sattributes, hostmetrics, and kubeletstats receivers active. The gateway Deployment pair (two replicas) lands at 150–300Mi RAM depending on trace and log volume. Those numbers sound fine until you enable the filelog receiver without a max_log_size cap on a node running chatty Java services — memory climbs past 600Mi per agent within hours as the receiver buffers partial reads. Always set this:

receivers:
  filelog:
    include: [/var/log/pods/*/*/*.log]
    operators:
      - type: recombine
        is_last_entry: 'body matches "\\n$"'
    # Without this, a single runaway container will OOM your agent
    max_log_size: 1MiB

The gateway pattern solves the wrong problem if your observability backend can't absorb what the gateway is forwarding. A self-hosted Prometheus with a 15-second scrape interval and no remote write configured is almost always the actual bottleneck — not the collector. Once you're retaining more than 30 days of metrics, the local Prometheus TSDB becomes a liability: compaction stalls, WAL replay on restart takes minutes, and head block size blows up. Mimir (single-binary mode) or Thanos sidecar are the practical exits. Mimir's single-binary mode is underrated for home-lab — one pod, object storage backend, and you get multi-tenancy and long-term retention without the Thanos component sprawl. The gateway's prometheusremotewrite exporter points at Mimir and you're done:

exporters:
  prometheusremotewrite:
    endpoint: "http://mimir:9009/api/v1/push"
    headers:
      X-Scope-OrgID: homelab
    tls:
      insecure: true  # internal cluster, no cert needed
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 120s

Single-node k3s or a local kind cluster doesn't need this pattern. Two Helm releases, separate ConfigMaps, a Service for gateway ingestion — that's real operational overhead when you're the only person maintaining it and the cluster has four pods running. A single collector Deployment with hostmetrics, k8s_cluster, and kubeletstats receivers combined is simpler and covers every signal you'll realistically query. The agent + gateway split pays for itself when you have three or more nodes, multiple teams writing to the cluster, or a backend that needs traffic shaping before it receives data. Before that threshold, it's complexity theater.

The debug workflow that actually catches misconfiguration before it silently drops your data: wire a debug exporter with verbosity: detailed into a parallel pipeline that mirrors your production receivers, then check the zpages endpoint. You don't need to redeploy anything in production to see what's flowing:

exporters:
  debug:
    verbosity: detailed        # logs full telemetry to collector stdout
    sampling_initial: 5        # first 5 per second, then throttle
    sampling_thereafter: 20

service:
  extensions: [zpages]
  pipelines:
    metrics/debug:
      receivers: [prometheus]  # same receiver, separate pipeline
      processors: [memory_limiter]
      exporters: [debug]

extensions:
  zpages:
    endpoint: 0.0.0.0:55679   # port-forward this to localhost

Then from your workstation:

kubectl port-forward -n otel svc/otel-gateway 55679:55679
# Open: http://localhost:55679/debug/pipelinez

The /debug/pipelinez page shows accepted, refused, and dropped counts per pipeline in real time. Refused means your processor rejected it (often a memory limiter threshold hit). Dropped means the exporter failed and the retry budget ran out. That distinction matters — refused data never left the agent; dropped data made it to the gateway and then got lost trying to reach the backend. If you're seeing drops and not refused counts, the problem is downstream, not in your pipeline config.

Validating the Stack End-to-End Before You Trust It

The most common mistake after wiring up an agent/gateway pipeline is trusting the absence of errors as proof it works. Collector logs staying quiet and Prometheus targets showing green means nothing if your tail sampler is silently dropping everything or the k8sattributes processor can't call the API server. Verify the full signal path before you put any workload under this stack.

Firing a Synthetic Trace Span

otel-cli is a statically compiled binary that speaks the OTLP gRPC protocol directly — no SDK, no app code, no sidecars. Install it on any jump box or run it inside the cluster as a one-shot pod, then fire a span at the agent's ClusterIP:

# Run from inside the cluster — replace agent-svc with your actual Service name
otel-cli span \
  --name test-span \
  --service my-app \
  --endpoint http://agent-svc:4317 \
  --tp-required false \
  --verbose

The --verbose flag prints the gRPC status response so you can distinguish "agent accepted" from "connection refused" immediately. If the span doesn't appear in Jaeger or Tempo after about 10 seconds, resist the urge to blame the agent first. Check the gateway's tail sampler policy. A tail sampler configured with policy: probabilistic, sampling_percentage: 10 will silently discard 90% of traffic, including your synthetic span if it lands in a sampled-out trace ID bucket. Temporarily set sampling_percentage: 100, refire the span, then tune back down. If it still doesn't appear, check whether the gateway's OTLP exporter is targeting the right backend address — a wrong hostname fails silently unless you have verbosity: detailed set under the exporter's sending_queue block.

Metric Pipeline Integrity via Internal Telemetry

Both the agent and the gateway expose their own Prometheus metrics on port 8888 by default. The two counters worth watching are otelcol_receiver_accepted_metric_points and otelcol_exporter_sent_metric_points, scoped per receiver/exporter label. A healthy pipeline has these tracking each other closely. When accepted exceeds sent for longer than a rolling 60-second window, data is being dropped inside the collector — not lost on the wire, but discarded after ingestion.

# Prometheus query to catch sustained drop gaps on the gateway
(
  rate(otelcol_receiver_accepted_metric_points{job="otelcol-gateway"}[2m])
  -
  rate(otelcol_exporter_sent_metric_points{job="otelcol-gateway"}[2m])
) > 0

A non-zero result here almost always points to the memory limiter processor. The default memory_limiter config in most example pipelines sets limit_mib: 512 — fine for small clusters, but under real scrape load with hundreds of pods the gateway resident set climbs fast. Either raise limit_mib to match the container's actual memory limit (leave 20% headroom), or reduce the agent's scrape interval from 15s to 30s if you have flexibility there. The other thing to check is otelcol_exporter_send_failed_metric_points: that counter increments when the downstream backend (Prometheus remote write, Cortex, Mimir) is rejecting writes, which looks identical to a memory limiter problem in dashboards unless you separate the two counters.

Log Pipeline Label Smoke Test

The log path has a failure mode that produces no errors anywhere: the k8sattributes processor runs, calls the API server to enrich log records, fails due to an RBAC gap, and then passes the log through without the Kubernetes metadata labels rather than dropping it or logging a warning. Logs arrive in Loki, they look fine, but every query that filters on kubernetes.pod_name or kubernetes.namespace_name returns nothing.

# Write a known log line from an arbitrary pod
kubectl exec -n your-ns deploy/any-deployment -- \
  sh -c 'echo "otel-label-test-$(date +%s)"'

# Then query Loki — requires logcli or the Loki HTTP API
logcli query '{namespace="your-ns"}' --limit=5 --output=jsonl | \
  jq '.labels | has("kubernetes.pod_name")'

If the has() check returns false, the processor isn't attaching metadata. The most common cause is a missing ClusterRole binding. The k8sattributes processor needs get, watch, and list on pods, namespaces, nodes, and replicasets — that last one catches people because it's needed for resolving owner references back to Deployments. A quick way to confirm it's an RBAC problem rather than a misconfigured processor is to check the collector pod logs for lines containing k8sattributes and Forbidden; those only appear at startup when the processor does its initial list call. If the collector started cleanly but labels are still missing, the issue is more likely the extract.metadata block in your processor config not listing k8s.pod.name explicitly.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

FileRun, OnlyOffice, and Pangolin for a Self-Hosted Web Calendar: Which One Actually Fits Your Stack

우병수 — Mon, 22 Jun 2026 07:49:05 +0000

TL;DR: The setup sounds simple until you actually try to build it: a calendar you can open in a browser, sync to your phone over CalDAV, host on your own hardware, and never pay a per-seat bill for. That constraint eliminates almost every polished option immediately.

📖 Reading time: ~23 min

What's in this article

The Problem: You Want a Web Calendar Without Renting Someone Else's Server
The Constraint That Forces a Choice
FileRun 2026: Calendar as a Bolt-On to a File Manager
OnlyOffice: Calendar Inside a Document Collaboration Suite
Pangolin: The Reverse-Proxy-Native Approach
Side-by-Side: What You Actually Get Per Use Case
Gotchas That Cost Time Across All Three

The Problem: You Want a Web Calendar Without Renting Someone Else's Server

The setup sounds simple until you actually try to build it: a calendar you can open in a browser, sync to your phone over CalDAV, host on your own hardware, and never pay a per-seat bill for. That constraint eliminates almost every polished option immediately. What's left is a pile of self-hosted tools that technically touch calendars but were mostly built to solve adjacent problems — and the documentation rarely admits that.

The usual suspects disappoint in predictable ways. Nextcloud does have a working CalDAV implementation and a decent browser UI, but you're pulling in a full PHP application stack, a mandatory database, and a sprawling plugin ecosystem just to get a calendar. On a small VPS or a home server with limited RAM, Nextcloud idles at several hundred megabytes before anyone logs in. Radicale is the opposite extreme — a tight Python CalDAV/CardDAV server that runs in about 20MB, has zero browser UI, and requires you to already know what a .ics file is to do anything with it. The guides that recommend Radicale almost always pair it with a third-party web frontend that's either abandoned or requires a separate Node or PHP runtime, which defeats the point. And most tutorials conflate file sync with calendar functionality entirely — they'll walk you through mounting a WebDAV share and call it done.

FileRun 2026, OnlyOffice, and Pangolin approach the problem from three genuinely different angles, which is why comparing them forces a real decision rather than a preference. FileRun is a file manager that added calendar and contacts through tight OnlyOffice integration — the calendar exists because OnlyOffice Documents ships with one, and FileRun wires up the auth. OnlyOffice as a standalone Document Server gives you the office suite and calendar surface, but calendar is not its primary job and the deployment complexity reflects that. Pangolin is a newer entrant that treats calendaring as a first-class feature alongside its reverse proxy and identity layer, which means the calendar ships with SSO baked in rather than bolted on. Each of those architectural choices has downstream consequences: what breaks when you upgrade, how CalDAV sync actually behaves with iOS and Android clients, and how much Docker Compose YAML you're maintaining on a Sunday afternoon.

The trade-off is real because none of these are drop-in. FileRun with OnlyOffice gives you the most polished browser experience but requires two cooperating containers plus a database, and the OnlyOffice Document Server alone wants at least 2GB of RAM allocated before it feels stable. Pangolin's calendar story is newer and the ecosystem is thinner, but the identity and routing layer it ships with is genuinely useful if you're running multiple self-hosted services behind the same domain. If you're already evaluating what tooling sits around your self-hosted stack — including AI-assisted development tools — the tradeoffs around local vs. cloud-hosted capability are worth thinking through in parallel; see our guide on AI Coding Tools in 2026: Cloud Copilots vs Local Models for how that decision tree plays out.

One thing that doesn't get said enough in self-hosting calendar guides: CalDAV compliance is not binary. A server can pass basic sync and still break iOS's birthday calendar sync, still corrupt recurring event exceptions on Android, or still refuse to serve free/busy data to a second client. The version of the CalDAV server underneath — whether that's the one bundled in OnlyOffice, FileRun's integration layer, or Pangolin's backend — determines which of those edge cases you'll hit. That's the part worth testing before you migrate anything important off Google Calendar or iCloud.

The Constraint That Forces a Choice

The Hardware Baseline and Why It Forces Real Decisions

The comparison here isn't theoretical. Single-node Docker host, 16 GB RAM, no GPU, Caddy handling TLS termination and reverse proxying. That constraint eliminates a whole class of recommendations immediately — anything that idles at 2+ GB RAM per service is already competing for headroom with databases, reverse proxies, and whatever else is running on the same box. Calendar workloads have no GPU requirement, but they do have a latency expectation: a user hitting a CalDAV sync or loading a shared calendar invite shouldn't wait four seconds while a JVM warms up.

Three axes determine which stack survives that environment. First: calendar feature completeness — specifically CalDAV protocol support, per-user sharing with granular permissions, and outbound invite handling that external clients (iOS Calendar, Thunderbird, Android via DAVx⁵) can actually consume without custom workarounds. Second: resource overhead at idle and under concurrent sessions, because a service that idles at 80 MB but balloons to 900 MB when three users sync simultaneously is a different animal than one that's consistently 300 MB. Third: integration surface — how cleanly does this service plug into an existing Docker Compose stack with a shared Postgres instance, a shared Redis, or an SMTP relay container? Services that want their own bundled database, or that assume they own the network namespace, create operational drag that compounds over time.

Nextcloud comes up in every self-hosted calendar conversation for obvious reasons: the ecosystem is massive, the CalDAV implementation is mature, and the file access story is genuinely good. The problem is that Nextcloud's calendar features come attached to a full collaboration suite you probably don't want. On a 16 GB single-node host, a production Nextcloud install with PostgreSQL backend, Redis for file locking, and a PHP-FPM pool sized for even modest concurrency will consume 600 MB–1.2 GB at idle depending on worker configuration — before you add ONLYOFFICE or any other app. If the requirement is "calendar plus light file access," you're paying a significant resource tax for hub features (Nextcloud Talk, Activities, the full Files app with chunked uploads) that sit idle and still consume memory through background workers and cron jobs.

The sharper problem with Nextcloud is its cron dependency. The nextcloud-cron container — or a system cron hitting occ cron — has to run every five minutes or calendar reminders drift, shares stop propagating, and background jobs queue up silently. Miss it for an hour and you start debugging why invites aren't arriving, not realizing the jobs table has 400 pending entries. That operational overhead is fine when you're running a full Nextcloud deployment for a team, but it's hard to justify for a stack whose primary workload is CalDAV sync and occasional document preview. FileRun, Radicale, and Baikal sidestep this entirely — they're stateless or near-stateless per-request services without background job infrastructure. That's the concrete reason to look past Nextcloud when the scope is narrow.

# Rough idle footprint comparison — measure on your own host with:
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"

# Nextcloud (PHP-FPM + background workers, no apps beyond Calendar):
# nextcloud-app     ~420MiB / 16GiB
# nextcloud-cron    ~180MiB / 16GiB
# nextcloud-redis   ~12MiB  / 16GiB

# Radicale (Python, single process, SQLite or filesystem storage):
# radicale          ~28MiB  / 16GiB

# Baikal (PHP, no background workers):
# baikal            ~55MiB  / 16GiB

Those aren't benchmarks — run them yourself, because PHP-FPM pool sizes and OPcache configuration will shift numbers significantly. The point is the order of magnitude difference. A service stack that leaves 14 GB free after the calendar layer is running means you can colocate FileRun, an ONLYOFFICE Document Server for previews, and a Postgres 16 instance without swapping. That arithmetic is why the hardware baseline has to be explicit before any recommendation lands.

FileRun 2026: Calendar as a Bolt-On to a File Manager

The important framing to get right before you touch a config file: FileRun is a Dropbox-style self-hosted file manager that happens to expose CalDAV and CardDAV endpoints. It is not a calendar application with file storage attached. That distinction changes your expectations immediately — the calendar UI inside FileRun is bare-bones by design, and the real value of the CalDAV endpoint is for syncing with an external client (Thunderbird, Apple Calendar, DAVx⁵ on Android). If you want a rich calendar interface in the browser, you will be disappointed. If you want a single container that handles file sync and lets your phone sync its calendar without running a separate Nextcloud or Radicale instance, FileRun in 2026 is still a reasonable answer.

The MySQL 8 requirement is not negotiable and it is where most first-time deployments stall. FileRun's PHP layer also requires iconv and gd to be present — they are not always included in base PHP-FPM images and the error you get when they are missing is a generic 500, not a helpful "extension not found" message. A minimal working docker-compose.yml that avoids the common traps:

version: "3.9"

services:
  filerun-db:
    image: mysql:8.0
    container_name: filerun-db
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: changeme_root
      MYSQL_DATABASE: filerun
      MYSQL_USER: filerun
      MYSQL_PASSWORD: changeme_filerun
    volumes:
      - filerun-db:/var/lib/mysql
    # innodb_buffer_pool_size controls RAM floor; 128M is workable for small deployments
    command: --innodb_buffer_pool_size=128M --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci

  filerun:
    image: filerun/filerun:latest
    container_name: filerun
    restart: unless-stopped
    depends_on:
      - filerun-db
    environment:
      FR_DB_HOST: filerun-db
      FR_DB_PORT: 3306
      FR_DB_NAME: filerun
      FR_DB_USER: filerun
      FR_DB_PASS: changeme_filerun
    volumes:
      - filerun-html:/var/www/html
      - /mnt/data/filerun-userfiles:/user-files
    ports:
      - "8080:80"

volumes:
  filerun-db:
  filerun-html:

The official FileRun image bundles the PHP extensions including gd and iconv, so if you pull filerun/filerun directly you will not hit the extension trap. Where people get burned is when they try to run FileRun on a generic php:8.2-fpm base behind their own Nginx config — at that point you need to explicitly add docker-php-ext-install gd intl in your Dockerfile and rebuild. The utf8mb4 charset flags on MySQL are not optional either; FileRun stores file and calendar metadata that includes Unicode characters and the default MySQL 8 charset handling will silently truncate emoji in filenames or event titles.

The CalDAV endpoint lives at /dav.php. For Thunderbird with the TbSync or CardBook extension, or Apple Calendar on macOS/iOS, the URL format is:

# Apple Calendar / Thunderbird CardBook
https://filerun.yourdomain.com/dav.php/calendars/USERNAME/default/

# What the FileRun docs show (correct)
https://filerun.yourdomain.com/dav.php/calendars/USERNAME/

# What breaks Android DAVx⁵ (missing trailing slash)
https://filerun.yourdomain.com/dav.php/calendars/USERNAME

That trailing slash on the /calendars/USERNAME/ path is not mentioned in the FileRun documentation but it is the first thing to check when DAVx⁵ connects, authenticates successfully, and then shows zero calendars. The underlying issue is that the PHP router does not issue a redirect for the slash-less variant when the request comes from a DAV client rather than a browser — the server returns a 200 with an empty response body instead of a 301, so DAVx⁵ interprets it as "no calendars found" rather than a misconfigured URL. Apple Calendar is more forgiving and will follow the redirect; Android clients are not.

On resource usage: expect the PHP-FPM container to sit at roughly 80–120 MB RAM at idle with no active users. Under active file sync or calendar operations it will spike, but it returns to baseline quickly. The MySQL container is the variable — at the 128 MB innodb_buffer_pool_size shown above, it will idle around 200–250 MB total process RSS. If you push that setting to 512 MB (reasonable for a small team), MySQL alone can consume 400–500 MB. On a VPS with 2 GB total RAM, the combination is workable but leaves limited headroom if you are also running a reverse proxy and anything else on the same host. FileRun does not cache aggressively at the PHP layer, so repeated directory listings do go back to MySQL — keep that buffer pool large enough to hold your working set of file metadata or you will feel it in latency.

OnlyOffice: Calendar Inside a Document Collaboration Suite

The most common OnlyOffice mistake isn't a configuration error — it's running the wrong product entirely. OnlyOffice Docs (the standalone document editor, the one virtually every Docker tutorial covers) has no calendar. Zero. The calendar feature lives exclusively in OnlyOffice Community Server, which is a completely different container with a completely different resource profile. If you followed a guide that had you pull onlyoffice/documentserver, you got the editor only. Community Server is onlyoffice/communityserver, and conflating the two will cost you hours before you realize the calendar tab simply doesn't exist in what you deployed.

The Docker deployment reality for Community Server is aggressive on resources. The single container image runs MySQL, Elasticsearch, and RabbitMQ internally — not as separate compose services you can tune individually, but bundled inside the one container with its own internal process supervisor. On a 16 GB host, expect 3–5 GB RAM consumed at idle before a single user connects. Elasticsearch alone accounts for a significant chunk of that. You can cap it somewhat with JVM heap flags passed as environment variables, but you're fighting the architecture rather than tuning it:

docker run -d \
  --name onlyoffice-community \
  -p 80:80 -p 443:443 -p 5222:5222 \
  -e ELASTICSEARCH_SERVER_JAVA_OPTS="-Xms512m -Xmx512m" \
  -v /app/onlyoffice/CommunityServer/data:/var/www/onlyoffice/Data \
  -v /app/onlyoffice/CommunityServer/mysql:/var/lib/mysql \
  -v /app/onlyoffice/CommunityServer/logs:/var/log/onlyoffice \
  onlyoffice/communityserver

Dropping Elasticsearch's heap to 512 MB helps, but the service will start complaining under any real indexing load. This is not a box you run on a $6/month VPS.

The calendar feature surface is genuinely capable once you're in. Room booking is built in, external CalDAV sync exists, and iCal feed export works reliably. The UI is polished — arguably the most finished calendar UI in the self-hosted space. The friction shows up in CalDAV client compatibility. DAVx⁵ on Android connects without drama. macOS Calendar is where things get fiddly: the auto-discovery URL that Apple expects doesn't always resolve correctly, and you'll often end up manually constructing the endpoint URL in the format https://your-domain/caldav/[user-guid]/ rather than just pointing at the domain root. The docs claim broad CalDAV compatibility; the reality is you'll debug at least one client before everything syncs cleanly.

The overhead is defensible exactly once: when your team already needs collaborative document editing and would otherwise run a separate OnlyOffice Docs instance anyway. In that scenario, Community Server consolidates two services into one, and the 4 GB idle RAM cost gets spread across both use cases. If you only want a web calendar — maybe with some file storage on the side — that calculus falls apart immediately. FileRun with a CalDAV sidecar, or Nextcloud on a constrained machine, will deliver a usable calendar at a fraction of the memory footprint. Community Server's bundled architecture is a product decision optimized for replacing Google Workspace entirely, not for running one module of it.

Pangolin: The Reverse-Proxy-Native Approach

Pangolin solves a different problem than FileRun or OnlyOffice do. It doesn't ship a calendar, a file manager, or a document editor — it's a self-hosted tunneling and reverse proxy platform, roughly analogous to Cloudflare Tunnel but running entirely on infrastructure you control. The interesting move here is using Pangolin as the exposure layer for a lightweight CalDAV server like Baïkal or Radicale, so you get identity-aware access control and encrypted tunneling without touching a firewall rule. If your calendar host lives on a separate VLAN, a homelab node behind CGNAT, or a remote machine you'd rather not punch holes in, this architecture is worth understanding.

The core architectural difference: FileRun and OnlyOffice assume the service is directly reachable — you configure a reverse proxy in front, you open ports, you manage TLS termination yourself. Pangolin flips this. The tunnel daemon on your internal host dials outward to your Pangolin server; no inbound port ever opens. The Pangolin edge then applies zero-trust rules (user identity, device posture, resource policies) before a request even reaches your Baïkal container. For a calendar endpoint specifically, this means a stolen session token is less dangerous — the attacker still has to pass the identity check at the Pangolin layer before they can even issue a CalDAV request.

Configuring a Pangolin site target to front a Baïkal container requires careful attention to header forwarding. CalDAV clients — iOS Calendar, Thunderbird with the TbSync extension, and most DAV-aware clients — rely on the Authorization header passing through untouched, and they use the Host and DAV headers to discover capability. A common failure mode is the proxy stripping or rewriting Authorization: Digest headers, which causes silent auth failures that look like connectivity problems. The Pangolin site config to get this right:

# pangolin/config/sites/baikal-cal.yaml
site:
  id: baikal-cal
  target: http://baikal.internal:8800   # internal container, no port exposure needed
  tunnel:
    enabled: true
    daemon_host: homelab-node-01        # the host running the newt tunnel daemon

  proxy:
    preserve_host: true                 # CalDAV clients break if Host gets rewritten
    trusted_headers:
      - X-Forwarded-For
      - X-Forwarded-Proto
    pass_through_headers:
      - Authorization                   # Digest auth must not be consumed or stripped
      - DAV                             # capability negotiation header
      - Depth                           # required for PROPFIND recursion
      - Destination                     # required for MOVE/COPY operations
    strip_headers:
      - X-Internal-Token                # don't leak internal routing headers

  access:
    policy: authenticated               # Pangolin identity check before any proxying
    allowed_roles:
      - calendar-users

The Depth and Destination headers aren't documented in most proxy guides because they're invisible in browser-based apps — but CalDAV PROPFIND requests use Depth: 1 extensively, and any proxy that doesn't pass them through will produce 400 errors that look like Baïkal configuration failures. Test with curl before trusting a GUI client's error message:

# Verify PROPFIND reaches Baïkal with correct headers intact
curl -v \
  --request PROPFIND \
  --header "Depth: 1" \
  --header "Content-Type: application/xml" \
  --user "user:password" \
  --data '' \
  https://baikal-cal.your-pangolin-domain.example/dav.php/calendars/user/

# A working response starts with HTTP/2 207 (Multi-Status)
# A proxy header problem usually returns 401 or 400 with no DAV response body

The honest trade-off: you're now operating two separate systems where FileRun runs one. The Pangolin tunnel daemon (newt) on each internal host needs to stay running and connected, and you need to reason about two failure domains — the calendar service going down versus the tunnel or Pangolin edge going down. On my setup I'd handle this by running the newt daemon under a systemd unit with Restart=always and a separate health check, rather than assuming the Pangolin dashboard will catch a dropped tunnel fast enough. The payoff is real network isolation: the Baïkal container genuinely has no listening port reachable from outside its own host, and adding a second calendar service for a different team means adding a site config entry, not modifying firewall rules or figuring out NAT hairpinning again.

Side-by-Side: What You Actually Get Per Use Case

The comparison that matters isn't feature lists — it's what each stack costs you when it's idle and what breaks first under real conditions. Most self-hosters discover these limits the hard way after a weekend of setup, not from the docs. Let me short-circuit that.

Criteria

FileRun

OnlyOffice

Pangolin

CalDAV standard compliance

Partial — CalDAV exposed via bundled component, not a first-class implementation; interop quirks with Thunderbird

Reasonable RFC 4791 coverage but config is non-obvious; requires explicit HTTPS or clients silently refuse to connect

N/A — Pangolin is a reverse proxy/tunnel layer, CalDAV compliance is entirely the backend's problem

Idle RAM floor

~200–350 MB with PHP-FPM + MariaDB; predictable and low

1.2–2 GB at genuine idle with Document Server running; JVM-adjacent behavior — it never really sleeps

~50–80 MB; it's a Go binary doing tunnel brokering, not processing documents

Disk footprint after initial pull

~800 MB–1.2 GB including MariaDB image

6–8 GB for the full Document Server stack; the community edition image alone is over 4 GB compressed

Under 200 MB; single binary distribution path is available

Multi-user sharing UI

Solid file-level sharing with link expiry, password protection; calendar sharing is minimal compared to the file side

Room/group calendars work well when the full platform is deployed; sharing model is tightly coupled to the broader workspace concept

No sharing UI — delegates entirely to whatever is sitting behind it

Mobile client compatibility

Works with DAVx⁵ on Android and iOS native Accounts; occasional sync delay on the file side

Mobile web is passable; native mobile CalDAV via DAVx⁵ functions but the UI push is clearly toward the desktop browser experience

Transparent to clients — they hit whatever URL Pangolin exposes and never know there's a tunnel

Biggest single dealbreaker

Commercial license required for production; the free tier isn't usable long-term for a real deployment

Resource floor makes it indefensible on a 16 GB host running other services; you're burning RAM 24/7 for a feature most users open twice a week

People mistake it for a calendar product and are confused when there's no calendar — it's access infrastructure, full stop

FileRun verdict: The right call if you're already running it as a file manager and want calendar as a secondary convenience, not a primary capability. On constrained hardware — say a 2-core VPS with 4 GB RAM — FileRun's PHP-FPM footprint is manageable where OnlyOffice would be immediately out of the question. The CalDAV implementation is good enough for personal use and small teams, provided you're not expecting tight spec compliance. License cost is the real gate: budget for it or pick something else.

OnlyOffice verdict: Defensible only when real-time document collaboration is also on the requirements list. If someone on the team needs co-editing on DOCX or XLSX files in the browser, the resource spend amortizes across multiple use cases and starts to make sense. Running OnlyOffice purely for CalDAV on a host that also runs a database, a Node app, and a reverse proxy is resource waste you'll feel every time you SSH in and check htop. On a dedicated 32 GB machine with room to spare it's fine, but that's exactly the scenario where you probably have better options too.

Pangolin verdict: Not a calendar product and shouldn't be evaluated as one. Its specific value is solving the external access problem — getting a CalDAV server that lives on a machine without a public IP reliably reachable from the internet without punching firewall holes. Pair it with Baïkal (Docker image under 50 MB, strict RFC 4791 implementation, near-zero idle RAM beyond PHP-FPM) and you get a minimal-footprint calendar stack that's actually reachable from iOS and Android over WireGuard or the Pangolin tunnel. That combination beats any of the heavier stacks for operators who just want calendars to sync and don't need a document suite hanging off the side.

# Baïkal + Pangolin: the lightest viable external CalDAV setup
# docker-compose.yml fragment

services:
  baikal:
    image: ckulka/baikal:nginx  # ~48 MB compressed; nginx variant avoids Apache overhead
    volumes:
      - ./baikal/config:/var/www/baikal/config
      - ./baikal/Specific:/var/www/baikal/Specific
    restart: unless-stopped
    # Do NOT expose a port directly — let Pangolin handle TLS termination

  pangolin:
    image: fosrl/pangolin:latest
    environment:
      - PANGOLIN_UPSTREAM=http://baikal:80
      - PANGOLIN_DOMAIN=cal.yourdomain.com
      # Pangolin handles ACME cert renewal; clients see valid TLS, baikal sees plain HTTP internally
    ports:
      - "443:443"
      - "80:80"
    depends_on:
      - baikal
    restart: unless-stopped

The one gotcha with this Baïkal pairing: Baïkal's admin interface uses the same nginx instance as the CalDAV endpoint. Restrict /admin at the Pangolin or upstream config level — don't leave it accessible to the public-facing URL. Pangolin's route config supports path-level rules, so this is a one-liner addition rather than a separate nginx config overlay.

Gotchas That Cost Time Across All Three

The CalDAV principal URL problem will burn you on every new client setup, no matter which backend you're running. Most calendar apps — Thunderbird, iOS Calendar, DAVx⁵ — attempt auto-discovery by hitting /.well-known/caldav and following redirects to find the principal URL, then the calendar collection. The spec describes how this should work. Reality is messier: some clients interpret a redirect to the principal as the collection itself, others stop after one hop, and a few just silently fail without logging what they actually tried. The only configuration that reliably works across all clients is skipping auto-discovery entirely and hardcoding the full collection path. For FileRun that's something like https://files.example.com/dav.php/calendars/username/default/. For Baikal behind Pangolin, it's https://cal.example.com/dav.php/principals/username/ for the principal, but clients want the collection one level deeper. Test each client explicitly — don't assume that because auto-discovery worked in one, it'll work in another.

TLS termination at a reverse proxy breaks CalDAV in a way that's genuinely confusing to debug because the app logs show successful requests while every client silently refuses to authenticate. What's actually happening: Caddy or Nginx is terminating HTTPS and forwarding plain HTTP to the backend container. The CalDAV server generates redirect URLs and auth challenges using HTTP, the client sees those and either refuses to send credentials over what it thinks is a plaintext connection, or the authentication handshake fails because the scheme mismatch breaks the URL comparison logic. The fix is the same whether you're using Caddy or Nginx — you need to explicitly forward the original protocol. In Nginx:

proxy_set_header X-Forwarded-Proto https;
proxy_set_header X-Forwarded-Host $host;
# Without these two lines, PROPFIND responses contain http:// hrefs
# and clients will either fall back silently or fail auth

In Caddy, reverse_proxy forwards X-Forwarded-Proto automatically only if you're using the https scheme in the upstream address — if you're proxying to a container by Docker network name over plain HTTP (which is normal), you need header_up X-Forwarded-Proto https explicitly. Miss this and you'll spend an hour staring at 401s that aren't actually authentication failures.

FileRun has a specific gotcha that isn't prominently documented: the APP_ID environment variable gets written into the database on first boot, and that database value is what the application actually uses for URL generation and CSRF validation. If you change the domain the instance is served from — even just switching from an IP to a hostname, or adding a subdomain — updating the env var alone does nothing. The running instance reads the stored value. You need a direct database update:

-- Run against your FileRun MySQL/MariaDB database
UPDATE fc_settings SET value = 'https://newdomain.example.com'
WHERE name = 'APP_ID';

-- Then restart the FileRun container; the env var at that point
-- should match what's in the DB to avoid future confusion

Failing to do this produces subtly broken behavior — file sharing links generate with the old domain, CalDAV redirects point to the wrong origin, and WebDAV clients reject the server's responses. The env var and the DB value need to be in sync, with the DB value being authoritative.

OnlyOffice Community Server bundles Elasticsearch as an internal dependency for search and, depending on your installation path, for calendar indexing. ES requires a kernel-level setting — vm.max_map_count must be at least 262144 — and on most default VPS images it's set to 65530. When the ES container hits that limit it fails to start, but the OnlyOffice application container doesn't surface this as a clear error. The calendar UI just stops working or loads empty. Check before anything else:

# On the host — not inside the container
sysctl vm.max_map_count
# If output is below 262144:
sysctl -w vm.max_map_count=262144
# To persist across reboots:
echo "vm.max_map_count=262144" >> /etc/sysctl.conf

On a shared VPS where you don't have host-level access, this is a hard blocker — you cannot fix it from inside a container, and privileged mode doesn't help with kernel parameters that require host access. If you're on a managed host without shell access to the hypervisor node, OnlyOffice Community Server is not a viable option. That's not a configuration problem you can engineer around.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Chamberlain MyQ and HomeKit: Self-Hosted Bridges That Actually Work

우병수 — Fri, 19 Jun 2026 07:46:57 +0000

TL;DR: Chamberlain's 2023 API lockdown was deliberate and aggressive. They didn't deprecate an old endpoint or bump a version — they actively blocked third-party apps from authenticating with the MyQ cloud API, killing integrations for Google Home, SmartThings, and every HomeKit bridge

📖 Reading time: ~19 min

What's in this article

The Problem: MyQ Locked You Out of Your Own Garage
Option 1: Homebridge with the homebridge-myq Plugin (Cloud-Dependent, Increasingly Brittle)
Option 2: ratgdo — Local Control Hardware That Removes MyQ From the Equation
The Docker Stack That Ties It Together
Comparing the Two Approaches on Real Constraints
Gotchas That Will Cost You Time
Where This Fits in a Broader Self-Hosted Automation Stack

The Problem: MyQ Locked You Out of Your Own Garage

Chamberlain's 2023 API lockdown was deliberate and aggressive. They didn't deprecate an old endpoint or bump a version — they actively blocked third-party apps from authenticating with the MyQ cloud API, killing integrations for Google Home, SmartThings, and every HomeKit bridge that routed through their servers. The stated reason was "unauthorized" access. The actual effect was forcing users toward Chamberlain's own paid ecosystem. That decision isn't being walked back.

The native HomeKit path Chamberlain offers costs real money. MyQ doesn't support HomeKit natively — you need their separate MyQ Home Bridge hardware, which runs around $100, stacks on top of a garage opener you already paid for, and still depends on Chamberlain's cloud infrastructure to function. So you're buying hardware to access a cloud service that can break whenever Chamberlain has downtime or decides to change the rules again. That's not a smart trade-off for anyone running a home automation stack with reliability requirements.

If you already have Docker running on a home server, the compute cost of a local bridge is negligible — a container using under 100MB of RAM, no GPU required. More importantly, a properly configured local bridge survives three failure modes that the cloud path can't handle: your internet going down, Chamberlain's servers having an outage, and future API policy changes. That last one is the critical argument for self-hosting here. You're not just solving today's problem — you're insulating your garage automation from a company that has already demonstrated willingness to break integrations for commercial reasons.

Two fundamentally different approaches exist, and conflating them leads to bad decisions. The first category is software-only bridges — tools like homebridge-myq or older forks that still attempt to reverse-engineer or work around the MyQ cloud API. These are fragile post-lockdown. They work until the next authentication change, and Chamberlain has shown they'll keep patching those gaps. The second category is hardware-based local control — ratgdo, for example, which replaces the MyQ cloud dependency with a local device wired directly to your opener's security+ bus. No cloud call, no API key, no Chamberlain servers in the loop. This article covers both approaches honestly, including where the software path still makes sense (it does, in specific narrow situations) and where hardware-local control is the only durable answer.

Option 1: Homebridge with the homebridge-myq Plugin (Cloud-Dependent, Increasingly Brittle)

The surprising thing about homebridge-myq is that it still functions at all. Chamberlain has actively blocked third-party API access multiple times — explicitly telling integrators their platform is off-limits — yet the plugin keeps getting patched because the MyQ mobile app still has to talk to some endpoint, and determined plugin maintainers keep reverse-engineering it. That's the entire situation summarized: you're betting on a cat-and-mouse game staying in your favor.

Homebridge itself is solid. It's a Node.js process running a HAP server that iOS treats as a real HomeKit bridge. Spin it up with Docker Compose and your phone finds it within a minute or two. The web UI on port 8581 handles plugin installs, config editing, and log tailing — you don't need to touch the container shell for day-to-day operation. Here's a minimal compose file that actually works:

services:
  homebridge:
    image: homebridge/homebridge:latest
    restart: unless-stopped
    network_mode: host          # HAP discovery requires mDNS — bridge networking breaks pairing
    volumes:
      - ./homebridge:/homebridge # persistent config, plugins, and credentials
    environment:
      - PGID=1000
      - PUID=1000
      - HOMEBRIDGE_CONFIG_UI_PORT=8581

network_mode: host is non-negotiable. HAP uses mDNS for HomeKit discovery, and if you run this behind Docker's NAT bridge, your iPhone will never find the bridge — or it'll find it once and lose it on every container restart. Once the container is up, install homebridge-myq from the plugin UI and add this block to your config.json:

{
  "platform": "myQ",
  "email": "you@example.com",
  "password": "yourpassword",
  "polling": {
    "garage": 30
  },
  "options": {
    "lockoutMinutes": 5      // backs off automatically if Chamberlain starts rejecting requests
  }
}

Don't set polling below 30 seconds. The MyQ backend rate-limits aggressively, and once your account gets flagged you'll get auth failures that look identical to a broken plugin — hours of debugging a problem that's actually just a cooldown. Pin the plugin version in your package.json inside the Homebridge volume after you find a working release. The plugin has gone through version 9, 10, and now v11 branches with breaking changes on each; auto-updating through the UI has bitten a lot of people. Check the GitHub issues for homebridge-myq before any Chamberlain app update rolls out, because those updates frequently rotate API tokens or change endpoint paths within days of release.

The honest trade-off here: this costs nothing extra, takes under ten minutes, and the HomeKit UX is genuinely clean once it's working. But the failure mode is brutal — Chamberlain changes an endpoint, the plugin stops polling, and your garage door disappears from Home app with zero notification. No alert, no automation failure email, just silence. If you have a physical keypad or a secondary entry point, that's annoying. If the garage is your front door and you run automations like "unlock on arrival," a silent failure at 11pm is a real problem. For that use case, keep reading and look at the hardware-based options instead.

Option 2: ratgdo — Local Control Hardware That Removes MyQ From the Equation

The most interesting thing about ratgdo (Rage Against The Garage Door Opener) is what it doesn't do: it doesn't talk to Chamberlain's cloud, it doesn't poll an API, and it doesn't break when Chamberlain decides to lock down their ecosystem again. Instead, it wires directly to the Security+ 2.0 serial bus that your opener already exposes on its terminal strip and speaks the opener's native protocol. That means the board gets real door state — not an approximated tilt sensor reading — and can issue open/close commands at the same level as a hardwired wall button. The firmware exposes everything over MQTT or ESPHome. No cloud path exists in this architecture because none is needed.

The hardware side is genuinely simple. You connect three wires from the ratgdo board to the opener's terminal strip: GND, and the two Security+ 2.0 serial lines (labeled TX and RX from the board's perspective). That's the entire physical installation on a compatible Chamberlain or LiftMaster opener. Flashing is done through the browser-based installer at install.ratgdo.info — you plug the board into USB, hit install, and the site handles the WebSerial flash without touching the Arduino IDE. Total BOM cost lands between $20 and $35 depending on whether you source a pre-assembled board or build your own around an ESP8266/ESP32 module. Compare that to the monthly risk of whatever Chamberlain decides to do next quarter.

The HomeKit integration path has a few hops but each one is solid. Flash the ESPHome firmware variant (not the MQTT one — ESPHome gives you cleaner integration downstream), then add the device to your ESPHome instance running either in Docker or as a Home Assistant add-on. From there you have two bridging options:

Home Assistant native HomeKit integration: HA's built-in HomeKit bridge exposes entities directly to the Home app. If you're already running HA, this is zero extra software.
Homebridge with homebridge-homeassistant: routes HA entities into Homebridge, which then bridges to HomeKit. More moving parts, but useful if Homebridge is already your HomeKit hub for other devices.

Either way, the ratgdo door entity shows up as a garage door accessory in HomeKit with open, closed, and opening/closing states — not a binary switch workaround.

The failure modes here are fundamentally different from the cloud-dependent options, and that matters operationally. Your RF remotes continue working in parallel — ratgdo adds a control channel, it doesn't replace the existing one. If the ratgdo board loses power or the firmware crashes, your remotes and wall button are completely unaffected. The realistic failure scenarios are a wiring mistake during install (reversing TX/RX is the classic one — swap them and reflash), or a firmware flash that didn't complete cleanly (re-run the web installer). Neither of those is a silent failure at 2 AM because Chamberlain rotated an API key. The board also survives any future Chamberlain cloud changes because it has never spoken to that cloud in the first place.

The Docker Stack That Ties It Together

The surprising part of this whole stack is how little you actually need. Three services, one Compose file, and the ratgdo firmware does most of the heavy lifting on the MQTT side. You're not running Home Assistant here — that adds a roughly 500MB image pull plus its own persistence layer, and it's unnecessary if your only goal is HomeKit control of a garage door.

version: "3.9"

services:
  mosquitto:
    image: eclipse-mosquitto:2.0
    container_name: mosquitto
    restart: unless-stopped
    ports:
      - "1883:1883"
    volumes:
      - mosquitto_data:/mosquitto/data
      - mosquitto_logs:/mosquitto/log
      # config must exist before first start or mosquitto refuses to launch
      - ./mosquitto/mosquitto.conf:/mosquitto/config/mosquitto.conf:ro
    healthcheck:
      test: ["CMD", "mosquitto_sub", "-t", "$$SYS/#", "-C", "1", "-i", "healthcheck", "-W", "3"]
      interval: 30s
      timeout: 5s
      retries: 3

  homebridge:
    image: homebridge/homebridge:latest
    container_name: homebridge
    restart: unless-stopped
    network_mode: host
    # host networking is not optional — mDNS for HomeKit pairing breaks in bridge mode
    # unless you run an mDNS reflector alongside it
    volumes:
      - homebridge_config:/homebridge
    environment:
      - PGID=1000
      - PUID=1000
      - HOMEBRIDGE_CONFIG_UI_PORT=8581
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8581"]
      interval: 60s
      timeout: 10s
      retries: 3
      start_period: 90s
    depends_on:
      mosquitto:
        condition: service_healthy

volumes:
  mosquitto_data:
  mosquitto_logs:
  homebridge_config:

The network_mode: host line on Homebridge is the thing that catches people. HomeKit device discovery uses mDNS (Bonjour), which doesn't cross Docker's bridge network boundary cleanly. If Homebridge is in a container network, your iPhone will pair once and then show "No Response" after a restart because the mDNS announcement never reaches your local subnet. host mode fixes this on a flat network. If ratgdo and Homebridge are on separate VLANs — say you've put IoT devices on a segregated subnet — you need Avahi running on the host with enable-reflector=yes in /etc/avahi/avahi-daemon.conf, plus a DNS override or static IP so ratgdo can resolve your broker hostname across the VLAN boundary. Pi-hole users: add a local DNS record for mqtt.home.arpa or whatever you're using, because ratgdo's MQTT firmware won't retry indefinitely if the first broker connection attempt fails at boot.

The homebridge-mqttthing plugin config is where the mapping happens. Install it through the Homebridge UI or drop this directly into your config.json accessories array:

{
  "accessory": "mqttthing",
  "type": "garageDoorOpener",
  "name": "Garage Door",
  "url": "mqtt://mosquitto:1883",
  "topics": {
    "getCurrentDoorState": "homebridge/garage/state",
    "getTargetDoorState":  "homebridge/garage/state",
    "setTargetDoorState":  "homebridge/garage/set"
  },
  "values": {
    "open":   "OPEN",
    "closed": "CLOSED",
    "opening": "OPENING",
    "closing": "CLOSING",
    "stopped": "STOPPED"
  },
  "optimistic": false
}

ratgdo's MQTT firmware publishes exactly these uppercase string payloads on its state topic by default — no transform function needed, no value template gymnastics. The optimistic: false flag matters: with it set to true, Homebridge assumes the door reached its target state immediately and won't update the tile until the next MQTT message. With it false, the Home app holds the "opening" animation until ratgdo publishes OPEN, which reflects actual reed switch state. That's the behavior you want.

For the Mosquitto config, the default out-of-box config rejects all anonymous connections since version 2.0 — you'll hit a silent failure where ratgdo connects and immediately disconnects with no useful log output from the container side. A minimal mosquitto.conf that actually works:

listener 1883
allow_anonymous true
persistence true
persistence_location /mosquitto/data/
log_dest file /mosquitto/log/mosquitto.log
log_type error
log_type warning
log_type information

If you want auth, add a password_file line and pre-generate credentials with mosquitto_passwd — but do that after you've confirmed the unauthenticated flow works end-to-end. Debugging MQTT auth failures through two firmware layers simultaneously is not a good use of an afternoon. Once the stack is stable, the healthcheck on Homebridge (curl -f http://localhost:8581) will catch process crashes that otherwise show up silently as "No Response" in the Home app hours later with no obvious cause.

Comparing the Two Approaches on Real Constraints

The most important thing ratgdo changes isn't features — it's the dependency graph. Once the ESP32 is flashed and talking MQTT to your local broker, the entire open/close/status loop runs inside your LAN. Chamberlain's servers can go down, your ISP can have an outage, the plugin author can abandon the project — none of that touches your garage door. homebridge-myq sits at the opposite extreme: it depends on Chamberlain's API being up, your internet connection being stable, the Homebridge plugin tracking whatever undocumented API changes Chamberlain pushes, and the MyQ app not triggering a lockout when it detects unusual auth patterns. Any one of those links breaks the chain.

The silent breakage problem with homebridge-myq deserves specific attention because it's worse than an obvious failure. The most common failure mode isn't an error — it's stale state. The Home app shows the door as closed when it's open, or vice versa, because a polling call silently failed and the plugin didn't invalidate its cached state. You don't find out until you ask Siri to close a door that's already closed, or worse, until you assume it's closed and leave. With ratgdo over MQTT, the ESP32 publishes state changes as they happen on the serial bus — there's no polling, no cache, and no cloud intermediary to go quiet without telling you.

Setup cost is real but asymmetric in an important way. homebridge-myq is genuinely fifteen minutes if you're already running Homebridge in Docker — npm install -g homebridge-myq, drop credentials into the config, restart. That's it. ratgdo is a different skill set entirely: you're physically accessing the opener's terminal block, connecting three wires (ground, serial TX, serial RX — no soldering required on the ratgdo v2.5 board), and flashing ESPHome firmware to an ESP32. None of those steps are hard, but "comfortable running physical wires near a garage ceiling" is a real prerequisite that homebridge-myq doesn't have. The ESPHome side is straightforward if you've touched it before:

# ratgdo ESPHome minimal config excerpt
substitutions:
  device_name: ratgdo
  friendly_name: Garage Door

external_components:
  - source: github://ratgdo/esphome-ratgdo@main
    refresh: 0s

ratgdo:
  id: myratgdo
  input_obst_pin: GPIO21   # matches ratgdo v2.5 pinout
  output_gdo_pin: GPIO19
  input_gdo_pin: GPIO20

lock:
  - platform: ratgdo
    name: "${friendly_name} Lock"
    ratgdo_id: myratgdo

cover:
  - platform: ratgdo
    name: "${friendly_name}"
    ratgdo_id: myratgdo

Latency is the axis that surprises people most when they switch. ratgdo's serial bus command reaches the opener in under a second — usually you hear the motor before Siri finishes confirming. homebridge-myq adds a full cloud round-trip: your command goes to Chamberlain's servers, gets processed, comes back as a state change, and Homebridge polls for it. On a good day that's one to four seconds. Under API load or with a slow polling interval, it can spike to ten or more, and occasionally the command just doesn't register. For an automation that's supposed to close the door when you leave home, a four-second delay is annoying; a silent failure is a security problem.

The decision tree is actually simple. Use homebridge-myq only if physical access to the opener is impossible — you're renting, the unit is in a shared space you can't touch, or this is a temporary setup you're tearing down in weeks. For any permanent installation, ratgdo is the correct answer. And if you're already running Home Assistant, skip the Homebridge layer entirely: the ratgdo ESPHome integration surfaces natively in HA as a cover entity with lock, light, and obstruction sensor sub-entities, and from there HomeKit integration is one toggle in the Home Assistant Apple TV / HomePod bridge. Fewer processes, fewer config files, fewer things to break on an upstream update.

Gotchas That Will Cost You Time

The mDNS problem swallows more hours than any other issue in this stack. Homebridge uses HAP (HomeKit Accessory Protocol) discovery over mDNS, and Docker's default bridge network silently blocks multicast traffic. iOS never sees the accessory — the Home app just shows nothing. No error, no log entry, just absence. The fix is one line in your Compose file, but you won't find it in the Homebridge Docker README until you've already burned an afternoon:

services:
  homebridge:
    image: homebridge/homebridge:latest
    # host networking lets mDNS multicast reach your LAN interface
    # without this, iOS cannot discover the HAP bridge at all
    network_mode: host
    environment:
      - HOMEBRIDGE_CONFIG_UI_PORT=8581
    volumes:
      - ./homebridge:/homebridge
    restart: unless-stopped

If host networking is off the table for your setup (shared host, port conflicts), macvlan is the other viable path — it gives the container its own MAC and IP on your LAN, so mDNS behaves as if it's a physical device. Expect to spend 20 minutes configuring the macvlan parent interface correctly. Either way, never run Homebridge on the default bridge network and wonder why it won't appear.

The Security+ 1.0 vs 2.0 distinction matters enormously for what you can actually do. ratgdo communicates over the serial bus that Chamberlain introduced with Security+ 2.0, present on most openers from 2011 onward. If your opener is older — or a lower-end model that never got the serial bus — ratgdo falls back to dry-contact relay mode. That works for triggering open and close, but the opener has no way to report its current state back over a relay. Your HomeKit tile becomes write-only: you can tap it, but the position shown is whatever Homebridge last assumed, not ground truth. A reed sensor on the door itself fixes this, wired back to ratgdo's obstruction/sensor input or to a separate ESP32 running ESPHome. Without it, automations that check "is the garage closed?" will eventually lie to you.

If you're running homebridge-myq (the cloud-dependent plugin) rather than ratgdo, Chamberlain's rate limiter is a real operational hazard. Polling too frequently returns HTTP 423 — not a network error, not a timeout, just a locked-out response — and the lockout lasts 15 to 30 minutes. During that window every door tile shows "No Response." Set your polling interval to 60 seconds minimum, and enable the plugin's built-in post-command delay:

// homebridge config.json — homebridge-myq platform block
{
  "platform": "myQ",
  "options": {
    "polling": {
      "openDuration": 15,
      "closedDuration": 30,
      // avoid re-polling immediately after you send an open/close command
      "eventDuration": 60
    }
  }
}

The ratgdo firmware update trap is subtle specifically because it fails silently. When ratgdo renames an MQTT topic between releases, homebridge-mqttthing stops receiving state updates but logs nothing useful — the plugin is still subscribed, the broker is still running, the topic just no longer matches. The Home app displays "No Response" and nothing in the Homebridge logs points at the real cause. Pin your firmware version in the ESPHome YAML and treat ratgdo upgrades as a deliberate change requiring verification, not a routine update:

# esphome device YAML — pin the ratgdo component ref
external_components:
  - source:
      type: git
      url: https://github.com/ratgdo/esphome-ratgdo
      # pin to a specific commit hash, not 'main'
      ref: 3f8a2c1
    components: [ratgdo]

After any ratgdo firmware change, pull up your MQTT broker's topic tree with mosquitto_sub -h localhost -t '#' -v and confirm the topics your homebridge-mqttthing config expects are actually being published before you close the terminal. Thirty seconds of verification saves a debugging session that will happen at the worst possible time.

Where This Fits in a Broader Self-Hosted Automation Stack

The most underrated benefit of getting this working locally isn't the HomeKit integration itself — it's that your garage door finally behaves like a real sensor in your stack. Once ratgdo is publishing state over MQTT and Homebridge is exposing a stable local accessory, HomeKit's native automation engine can treat door state as a first-class trigger. Arrival and departure automations via iPhone location work without phoning home to Chamberlain's servers. Time-of-day rules that close the door at 10 PM if it's been left open — those work during an internet outage. The reliability difference between a cloud-polled accessory and a local one becomes obvious the first time your ISP has a bad afternoon.

If you're already running an MQTT broker alongside n8n, the ratgdo state topic is a clean, push-based event source. Instead of a cron job hammering a cloud API every 60 seconds hoping the door state changed, you get an MQTT trigger node that fires exactly when the door opens or closes. A practical flow looks like this:

# ratgdo publishes to topics like:
ratgdo/garage/status/door   # → "open" | "closed" | "opening" | "closing"
ratgdo/garage/status/light  # → "on" | "off"
ratgdo/garage/status/motion # → "detected" | "clear"

# n8n MQTT Trigger node config:
# Broker: mqtt://192.168.1.x:1883
# Topic: ratgdo/garage/status/door
# Then: IF node checks payload === "open"
#       + a Wait node holds 10 minutes
#       + another MQTT node re-checks current state
#       + if still open AND current time after sunset → push notification via Pushover

That sunset condition is the part worth building properly. Polling "is it dark out" from a cron is awkward. In n8n you can pull sunset time from a simple weather API call at the start of the flow and store it in a variable, then compare against new Date() in a Function node. The MQTT trigger eliminates the polling entirely — the only HTTP call in the whole flow is the outbound Pushover notification. For a broader look at how local event sources like this slot into multi-tool pipelines, the Workflow Automation in 2026: n8n, Zapier, and Self-Hosted Pipelines guide covers the architectural patterns in more depth.

The principle this whole setup demonstrates is worth generalizing. Any device where state lives in a vendor cloud is a liability — the MyQ situation made this visceral for a lot of people when Chamberlain started locking out third-party API access. The pattern that fixes it is consistent: find the local protocol (serial bus, Wiegand, Z-Wave, local HTTP), bridge it with cheap hardware (ratgdo, a Sonoff with custom firmware, an ESP32), publish to a local MQTT broker, and present it to HomeKit or whatever UI layer you need via a bridge like Homebridge. Smart locks that expose a Wiegand interface, older Ecobee thermostats with a local API, even some irrigation controllers — all of them respond to this same pattern. The ratgdo/Chamberlain case is just an unusually clean example because the hardware is purpose-built and the MQTT topic schema is well documented.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.