DEV Community: Guatu

Tesla P40 in a Homelab: 24GB of Inference on a Budget

Guatu — Mon, 25 May 2026 16:15:48 +0000

The Tesla P40 is a seductive piece of hardware: 24GB of VRAM for a fraction of the cost of a modern RTX card. But after three weeks of fighting with it, I realized that the "budget" part of the equation doesn't include the cost of my sanity. I spent more time debugging QEMU assertion errors and PCI address shifts than I did actually running models.

If you're looking to put a P40 in a Proxmox node to run LLMs, you're likely trying to fit larger models like Qwen2.5:32B into VRAM without spending four figures on an A100 or a 3090. It's a viable path, but the standard way of doing things (GPU passthrough to a VM) is a recipe for instability with this specific card.

The Passthrough Trap

My first instinct was to follow the standard Proxmox pattern: isolate the GPU using vfio-pci and pass it through to a dedicated Ubuntu VM. I've done this before, and usually, it's the right move for isolation. I had my IOMMU groups sorted and the hostpci line configured in the VM config.

It worked for about four hours. Then the P40 decided it didn't want to exist anymore.

The Tesla P40 lacks Function Level Reset (FLR). In a virtualized environment, this means that if the VM crashes or the driver hangs, the GPU doesn't actually reset. The next time you try to boot the VM, you get a QEMU assertion error or a "Device is already in use" message. I found myself hard-rebooting the entire physical node just to get the GPU to respond again. I've written about GPU passthrough gotchas before, but the P40 is particularly aggressive about breaking the happy path.

I also hit the PCI address instability issue. After a few reboots and some BIOS tweaks, the card shifted addresses, and my VM config became a lie. I was essentially playing a game of whack-a-mole with my hardware topology.

The Solution: Host-Level Inference

I stopped trying to be "architecturally clean" and decided to run the GPU directly on the Proxmox host. I know, running production-ish workloads on the hypervisor is usually a sin, but the P40 is too unstable in a VM to justify the overhead.

Here is exactly how I moved from a broken passthrough setup to a stable host-level inference engine.

1. Cleaning the Slate

First, I stripped the GPU out of the VM and killed the VFIO isolation. If you've already pinned your GPU to vfio-pci, you need to undo that.

# Remove the PCI device from the VM config
qm set <VM_ID> --hostpci0 ''

# Blacklist vfio to stop it from grabbing the card at boot
echo "blacklist vfio_pci" | sudo tee /etc/modprobe.d/vfio.conf
echo "blacklist vfio" | sudo tee -a /etc/modprobe.d/vfio.conf

# Update initramfs and reboot
update-initramfs -u
reboot

2. Host Driver Installation

I installed the NVIDIA 535 drivers directly on the Proxmox host. I chose 535 because it's stable with the P40's Pascal architecture.

sudo apt update
sudo apt install nvidia-driver-535
# Verify the card is seen and the driver is loaded
sudo nvidia-smi

3. Deploying Ollama as a Systemd Service

Instead of wrapping Ollama in a container on the host (which adds another layer of driver mapping pain), I deployed it as a systemd service. This ensures it starts on boot and has direct access to the GPU without runtime overhead.

I created a service file at /etc/systemd/system/ollama.service:

[Unit]
Description=Ollama
After=network.target

[Service]
User=ollama
Group=ollama
WorkingDirectory=/opt/ollama
ExecStart=/opt/ollama/ollama serve
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_KEEP_ALIVE=30s"
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

I set OLLAMA_HOST=0.0.0.0 so my other nodes in the cluster could hit the API, and OLLAMA_KEEP_ALIVE=30s to ensure the model unloads from VRAM quickly when not in use, leaving room for other tasks.

The VRAM Reality Check

With 24GB of VRAM, the P40 is a beast for its age, but it's not infinite. When I tried running Qwen2.5:32B, I noticed a massive performance drop as soon as the context window grew.

The issue isn't the model weights; it's the KV cache. If you allocate almost all 24GB to the model weights, there's no room left for the "memory" of the conversation. This leads to the model hallucinating or simply timing out.

To fix this, I had to use a more aggressive quantization (4-bit) and limit the context window. If you're running these models for AI agent orchestration, you need to be careful with the system prompts. A massive system prompt eats into your available VRAM before the first token is even generated.

Monitoring the Blind Spot

The biggest problem with running a GPU on the host is that you lose the visibility you get in a managed Kubernetes environment. nvidia-smi is great for a quick check, but it's useless for long-term stability monitoring.

I deployed nvidia_gpu_exporter as a DaemonSet on my Kubernetes cluster, but since the GPU is now on the host, I had to run the exporter as a standalone binary on the Proxmox node to feed metrics into my Prometheus instance.

If you're still using K8s for your GPU workloads, the standard NVIDIA device plugin isn't enough for real monitoring. You need the exporter to see things like temperature and power draw. For the P40, this is critical because it's a passive card. If your fans aren't dialed in, it will thermal throttle in seconds.

For those running the exporter in K8s, here is the manifest I use:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: nvidia-gpu-exporter
  template:
    metadata:
      labels:
        app: nvidia-gpu-exporter
    spec:
      containers:
      - name: exporter
        image: nvidia/gpu-exporter:latest
        ports:
        - containerPort: 9835
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

Why This Actually Works

The reason the host-level approach wins is simple: it eliminates the translation layer. When you pass a GPU through, you're relying on the IOMMU and the hypervisor to handle memory mapping and interrupts. The P40's lack of FLR means that any failure in that chain is permanent until a cold boot.

By running on the host, the NVIDIA driver has a direct line to the hardware. If the driver crashes, you can often reload the kernel module without rebooting the entire machine. It's a trade-off: you lose the "clean" separation of a VM, but you gain a system that actually stays online.

Lessons Learned

If I had to do this again, I would have skipped the VM phase entirely. The documentation for Proxmox GPU passthrough is great for cards that support FLR, but it's misleading for older Tesla cards.

A few other things to watch out for:

Cooling is not optional. The P40 is designed for server chassis with high-static pressure fans. In a homelab case, you need a 3D-printed shroud and a high-RPM fan bolted directly to the heatsink. If the card hits 80C, your tokens-per-second will plummet.
Driver Mismatches. I hit a wall where nvidia-smi failed after a Proxmox kernel update. This usually happens when the kernel module is updated but the userspace libraries are out of sync. Always check your dkms status after a dist-upgrade.
VRAM is the only metric that matters. Don't get distracted by CUDA core counts. For inference, the 24GB VRAM is the only reason to buy this card. If you can afford a 3090, buy the 3090. The P40 is for those of us who want the most VRAM for the least amount of money and are willing to fight the OS to get it.

The P40 is a fantastic way to get into local LLMs, provided you're okay with treating your hypervisor as a workstation. It's not the "correct" way to build a cluster, but it's the way that actually works.

Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working

Guatu — Mon, 25 May 2026 00:15:49 +0000

I once spent four hours debugging a PostgreSQL pod that was stuck in a crash loop with Input/output error across every single log line. I opened the Longhorn UI, and there it was: a bright green "Healthy" badge next to the volume. The replicas were synchronized, the nodes were up, and the dashboard insisted everything was perfect.

The reality was a stale mount on the worker node that had survived a pod migration, leaving the filesystem in a read-only state that Longhorn's control plane didn't care about.

If you're running stateful workloads on bare metal, you've probably already realized that Longhorn is great until it isn't. It simplifies distributed storage, but it introduces a layer of abstraction that can lie to you. You need to know the difference between "Control Plane Healthy" and "Data Plane Functional."

The Illusion of Health

In Longhorn, "Healthy" usually just means the replicas are in sync and the volume is attached to a node. It does not mean the application can actually write to the disk. I've hit this multiple times where the volume is technically healthy, but the pod is screaming because of permission mismatches or stale mounts.

The most common culprit is the mount layer. When a pod moves from Node A to Node B, Kubernetes expects the volume to detach and re-attach. Sometimes, the detach fails or the mount stays active on the old node. Longhorn might show the volume as attached to the new node, but the OS on the worker is still holding onto a ghost mount.

If you see I/O error in your logs but the UI is green, stop looking at the UI. You need to check the actual mount point on the worker node.

Solving the Stale Mount Trap

When a volume gets stuck, the "happy path" is to let Kubernetes handle the detachment. In reality, you often have to force the issue.

The first thing I try is scaling the deployment to zero. This forces Kubernetes to send the detach signal to the CSI driver.

# Scale to 0 to break the lock and force volume detach
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-db
spec:
  replicas: 0

If that doesn't work, you have to go into the worker node via SSH. I've found that manually unmounting the path usually clears the deadlock. Be careful here: if you unmount a volume that is actually being written to, you're asking for filesystem corruption.

# Check for stale mounts on the worker node
mount | grep longhorn

# If you find a mount that shouldn't be there
umount -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-xxxx-xxxx/mounted

The -l (lazy) unmount is the secret here. It detaches the filesystem from the hierarchy immediately, even if the resource is busy, and cleans up the references once the resource is no longer in use.

The Capacity Lie: Snapshot Bloat

Capacity management in Longhorn is where most people run into their first "production" outage. You set up a 100GB PVC, and a few months later, your node disks are at 95% capacity even though your application is only using 20GB of data.

This is snapshot bloat. Longhorn snapshots are incremental, but if you have a high-churn database (like Postgres or MariaDB) and a reckless snapshot schedule, those increments add up.

I learned this the hard way when I set up an hourly snapshot policy without a strict retention limit. The snapshots were accumulating on detached volumes that I had forgotten to delete. Longhorn doesn't automatically purge snapshots for volumes that aren't currently attached to a pod unless you explicitly tell it to.

To fix this, I adjusted my SnapshotSchedule to exclude detached volumes. This prevents the system from wasting IO and space on volumes that aren't even active.

apiVersion: longhorn.io/v1beta1
kind: SnapshotSchedule
metadata:
  name: daily-backup-critical
spec:
  schedule: "0 2 * * *" # 2 AM daily
  retention: 7           # Keep only 7 days
  excludeDetachedVolumes: true # Stop snapshotting dead volumes

If you're already in a capacity crisis, don't just delete PVCs. Check for orphaned replicas. Sometimes a PVC is deleted from K8s, but the Longhorn volume remains in the UI as "detached." These are ghosts eating your disk space. Purge them manually from the UI or via the API.

Permissions and the SecurityContext Gap

Another "health" issue that doesn't show up in monitoring is the Permission denied error. Longhorn mounts volumes as root by default. If you're running a container as a non-root user (which you should be), the application will fail to write to the volume immediately upon startup.

I ran into this with an n8n deployment. The pod was "Running," the volume was "Healthy," but the logs were a wall of permission errors.

The fix isn't to chmod 777 the volume (don't do that). The fix is using the fsGroup in the securityContext. This tells Kubernetes to change the ownership of the volume to a specific GID when it's mounted.

spec:
  securityContext:
    fsGroup: 1000 # Matches the UID/GID of the application user
  containers:
    - name: n8n-app
      image: n8nio/n8n:latest
      # ... rest of config

For databases, I also recommend being explicit about the data directory. Some images default to a path that might conflict with how the volume is mounted. I always override the data path to a sub-directory to avoid issues with the lost+found folder that Linux creates on the root of the volume.

env:
  - name: PGDATA
    value: "/var/lib/postgresql/data/pgdata"

Monitoring That Actually Matters

If you want to stop guessing, you need to move beyond the Longhorn UI. I use Prometheus and Grafana to track the actual replication state.

The metric I watch most closely is longhorn_volume_replica_state. If a replica moves from healthy to degraded or faulted, I want an alert before the application notices.

One specific thing to watch for is the "Replica Count" vs "Healthy Replica Count." If you have 3 replicas but only 2 are healthy, you're one disk failure away from a total outage. This is a silent killer because the volume will still report as "Healthy" in the UI as long as one replica is available.

I've integrated these alerts into my general infrastructure monitoring. If you're managing this at scale, I highly recommend looking into predictive maintenance consulting to set up these thresholds before you hit a "disk full" panic at 3 AM.

Gotchas and Tradeoffs

I've considered using Rook-Ceph for larger workloads, and while it's more powerful, it's a nightmare to manage in a small cluster. Longhorn is the right choice for most homelabs and small production setups, but you have to accept the tradeoffs:

CPU Overhead: Longhorn runs a manager pod for every volume. If you have 100 small volumes, your CPU usage will spike just from the management overhead.
Disk Pressure: Longhorn doesn't have a native "thin provisioning" that's as transparent as some enterprise arrays. You need to monitor the actual node disk usage, not just the PVC usage.
PDB Conflicts: If you have strict Pod Disruption Budgets (PDBs), you might find that kubectl drain hangs forever because Longhorn is struggling to move a volume. I've written about this in Pod Disruption Budgets: Why kubectl drain Gets Stuck on Longhorn.

Lessons Learned

The biggest takeaway from managing Longhorn on bare metal is that the storage layer is not a "set and forget" component.

If you're building out your storage, start with a solid foundation. I've detailed the initial setup in Kubernetes Storage on Bare Metal: Longhorn in Practice, but the operational side is where the real work is.

Always assume the UI is lying to you. When a pod fails, check the logs for I/O errors first, then check the worker node mounts, and only then trust the green checkmark in the Longhorn dashboard. Use fsGroup for every stateful app, set strict retention on your snapshots, and for the love of your sanity, exclude detached volumes from your backup schedules.

Condition-Based vs Time-Based Maintenance: Making the Switch

Guatu — Fri, 22 May 2026 16:15:49 +0000

I spent a weekend reviewing a maintenance log for a conveyor system that was costing thousands in "preventative" parts replacements every quarter, only to find that the technicians were throwing away bearings that had 60% of their life left. At the same time, a motor had burned out three weeks before its scheduled service because it had been running hot for a month, but the calendar said it wasn't time to check it yet.

Time-based maintenance is a gamble where you bet that the average failure rate of a component matches the actual failure rate of your specific machine. In the real world, that bet usually loses.

If you're managing industrial assets, you've likely lived through this. You either over-maintain, wasting money and introducing "infant mortality" failures by disturbing a working system, or you under-maintain and deal with unplanned downtime. The move to Condition-Based Maintenance (CBM) is the only way out, but the gap between the theory of "predictive maintenance" and a working system on the factory floor is massive.

What I tried first

My first attempt at CBM was naive. I thought I could just slap a few sensors on the equipment, pipe the data into a dashboard, and let the operators decide when to perform maintenance. I set up a basic MQTT pipeline using Mosquitto (which I've written about before regarding broker selection) and pushed raw vibration and temperature data to a Grafana dashboard.

It failed miserably.

First, I created a "noise apocalypse." I had alerts firing every time a sensor spiked for a millisecond due to electrical noise. The operators started ignoring the alerts entirely. Second, I didn't define what "bad" actually looked like. I was giving them raw data, not actionable intelligence. An operator doesn't care if a motor is at 72 degrees Celsius; they care if 72 degrees is a 10% increase over the baseline for that specific load.

I also tried to automate the ticketing system using simple cron jobs that checked for thresholds every hour. This led to a flood of "HEARTBEAT_OK" messages in the logs and redundant tickets. I was basically just building a more expensive version of a time-based system, just with different triggers.

The actual solution

The shift happens when you stop treating sensors as "alarms" and start treating them as "state providers." You need a pipeline that filters noise, establishes a baseline, and triggers actions based on deviations rather than arbitrary numbers.

1. Filtering the Noise

Instead of raw thresholds, I implemented a sliding window average. If you're using Python for your edge processing, don't just trigger on val > threshold. Use a buffer.

import collections

class SensorMonitor:
    def __init__(self, threshold, window_size=10):
        self.threshold = threshold
        self.window = collections.deque(maxlen=window_size)

    def is_anomaly(self, current_value):
        self.window.append(current_value)
        if len(self.window) < self.window.maxlen:
            return False

        # Calculate moving average to ignore transient spikes
        avg = sum(self.window) / len(self.window)
        return avg > self.threshold

# Example: Triggering maintenance only if the average 
# vibration stays high over 10 readings
monitor = SensorMonitor(threshold=10.5) 
if monitor.is_anomaly(current_vibration):
    trigger_maintenance_alert("Sustained high vibration detected")

2. Condition-Based Escalation Rules

Once the data is clean, you can't just send an email. You need escalation logic that understands the context. I moved away from simple cron-based alerts to a condition-based rule engine. This is similar to how I handle equipment health scoring, where we consolidate multiple signals into one status.

Here is how I structured the escalation logic in the configuration:

# Condition-based escalation rules for maintenance tickets
escalation_rules:
  - condition: "sensor.vibration > 12.0 AND asset.criticality == 'high'"
    action: "immediate_dispatch"
    priority: 1
  - condition: "sensor.temp_deviation > 15% AND ticket.age > 4h"
    action: "notify_maintenance_lead"
    priority: 2
  - condition: "sensor.vibration > 8.0 AND ticket.age > 24h"
    action: "schedule_inspection_next_shift"
    priority: 3

3. Optimizing the Alerting Pipeline

To stop the "alert fatigue" I mentioned earlier, I overhauled the cron jobs that monitored the system health. I stopped the unconditional "Everything is OK" messages and moved to a "silent success" model.

# Optimized payload for condition-based alerting
# Only sends a notification if the status is not 'success'
payload:
  message: "Asset {{ asset_id }} monitoring failed with status: {{ status }}"
  condition: "status != 'success'"
  # The system remains silent if the condition is met (success)
  reply: "All assets healthy" if status == 'success'

Why it works

Time-based maintenance assumes a linear degradation of parts. In reality, degradation is stochastic. A bearing might last 10,000 hours or 100 hours depending on the lubrication quality and the load it carries.

By moving to CBM, you're monitoring the actual degradation. When you track vibration (using the architecture I've detailed here), you're seeing the physical manifestation of wear (pitting, spalling, or misalignment) long before the part actually fails.

The logic of using a sliding window and deviation-based thresholds works because it separates the signal from the noise. In an industrial environment, electrical interference is a constant. A single high reading is usually a fluke; a sustained increase in the moving average is a mechanical reality.

also, the condition-based escalation rules prevent the "crying wolf" effect. By tying the action to both the sensor value and the asset's criticality, you ensure that the maintenance team only drops what they're doing when it actually matters.

Lessons learned

The biggest surprise was that the hardware wasn't the hard part. Getting the sensors to talk via MQTT is trivial. The hard part is the cultural shift. Operators who have spent twenty years changing oil every three months don't trust a dashboard telling them they can wait another two months.

If I did this again, I'd start with a "shadow period." I would run the CBM system in parallel with the time-based schedule for six months. I'd log every time the CBM system predicted a failure and every time the time-based schedule replaced a perfectly good part. Having that data is the only way to convince a skeptical plant manager to change the schedule.

A few other caveats:

Sensor Drift: Sensors fail too. If you rely solely on CBM, a failing sensor can look like a failing motor. You still need a basic time-based schedule for sensor calibration.
The "Silent" Trap: When you move to a "silent success" alerting model, you run the risk of not knowing if your monitoring system has died. I fixed this by implementing a dead-man's switch (heartbeat) that alerts if the monitoring service itself stops reporting.
Data Overload: Don't try to monitor everything. Pick the top 20% of assets that cause 80% of your downtime. Trying to implement CBM on every single small fan in the building is a waste of engineering hours.

For those looking to implement this at scale, you can check my services page for consulting on predictive maintenance and IIoT infrastructure. Moving from a calendar to a condition is a steep climb, but it's the only way to stop wasting money on parts that aren't broken.

Proxmox Cluster Quorum: How Many Nodes Do You Actually Need

Guatu — Mon, 18 May 2026 16:15:57 +0000

I woke up to a cluster that had effectively turned itself into a read-only museum. My VMs were running, but I couldn't start a new one, I couldn't migrate a workload, and the Proxmox GUI was throwing "Cluster not ready - no quorum" errors across the board. I had a two-node setup, one node had rebooted for a kernel update, and the remaining node decided that since it didn't have a majority, it no longer had the right to make decisions.

If you're building a Proxmox cluster, quorum is the one concept that will either be completely invisible or the primary reason your entire infrastructure freezes. Most people treat it as a checkbox during the cluster creation wizard, but in a home lab, the math of quorum often clashes with the reality of how many physical servers you can actually fit in your rack.

What I tried first

My initial instinct was that "Cluster" simply meant "nodes that can talk to each other." I assumed that as long as one node was alive, the cluster was alive. I set up two beefy nodes, linked them together, and felt confident.

Then I hit the "split-brain" wall. In a two-node cluster, the quorum requirement is (n/2) + 1. For two nodes, that means you need two votes to have a majority. If one node goes down, the remaining node has one vote. One is not greater than one. The remaining node loses quorum and enters a protective state. It stops allowing configuration changes to prevent a scenario where both nodes think they are the master and start writing conflicting data to shared storage, which is a great way to corrupt your VM disks.

I tried to "fix" this by manually forcing quorum on the surviving node using pvecm expected 1. It worked for a few minutes, but it's a manual band-aid. Every time a node rebooted or a network cable acted up, I was back in the CLI fighting with the cluster manager. I realized I was fighting the fundamental design of Corosync, and the only way out was to change the voting math.

The actual solution

You have three real options depending on your hardware budget and your tolerance for manual intervention.

Option 1: The Three-Node Standard

The cleanest way to solve quorum is to just add a third node. With three nodes, quorum is two votes. If one node dies, two remain. You still have a majority, and HA (High Availability) actually works as intended.

Option 2: The QDevice (The "Cheap" Vote)

If you can't justify a third full-sized server, you use a Quorum Device (QDevice). A QDevice is a lightweight external voter. It doesn't run VMs; it just tells the cluster "Yes, I see Node A." You can run this on a Raspberry Pi, a tiny VM on a separate host, or even a cheap VPS.

To set up a QDevice on a separate Debian/Ubuntu machine:

# On the QDevice server (the voter)
apt update && apt install corosync-qnetd

# On all Proxmox nodes
apt update && apt install corosync-qdevice

Once the software is installed, you initialize the device from one of the Proxmox nodes:

# Run this on one PVE node
pvecm qdevice setup <IP-OF-QDEVICE-SERVER>

This adds a third vote to the cluster without requiring a third Proxmox node. Now, if one PVE node fails, the other PVE node and the QDevice provide the two votes needed to maintain quorum.

Option 3: Monitoring and API Integration

If you're running a larger setup, you shouldn't be checking quorum by clicking through the GUI. I integrated pve_exporter with Prometheus to get alerts the second a node loses its vote.

Since I'm using token-based authentication to avoid the security risks of root passwords in plain text (see my post on Proxmox API Tokens), the setup looks like this.

First, create a restricted user for the exporter:

# Create user with PVEAuditor role
pveum user add prometheus@pve --realm local --password sEcr3T! --groups PVEAuditors

# Create API token for prometheus@pve
pveum token add prometheus@pve prometheus --privsep 0

Then, configure the pve_exporter YAML:

api:
  token_name: prometheus
  token_value: prometheus@pve!prometheus

And the Prometheus scrape config to target the nodes:

- job_name: 'proxmox'
  metrics_path: /pve
  scrape_interval: 30s
  params:
    cluster: ['1']
    node: ['1']
  relabel_configs:
    - source_labels: [__address__]
      regex: '^(10\.0\.0\.\d+)$'
      target_label: __param_target
      replacement: $1
  static_configs:
    - targets: ['10.0.0.x:9221']

Why it works

Proxmox uses Corosync for cluster membership and quorum. Corosync is designed for absolute consistency over availability (the "C" in the CAP theorem). It assumes that if you can't reach a majority of your peers, you are the one who is isolated, not them.

In a two-node cluster, there is no way to distinguish between "Node B is dead" and "The network cable between Node A and Node B is unplugged." If Node A decided to stay "active" while Node B also stayed "active," and both tried to modify the same shared storage (like a Ceph pool or an NFS share), you'd end up with a corrupted filesystem.

By adding a third vote (either a node or a QDevice), you break the tie. The node that can still talk to the QDevice knows it is part of the majority. The node that is isolated knows it's alone and gracefully steps back.

Lessons learned

The biggest lesson here is that High Availability (HA) is a lie if you don't have a proper quorum strategy. I spent a week thinking I had "HA" because I had two nodes and shared storage. In reality, I had a system that would freeze the moment I tried to update a BIOS or swap a NIC.

If you're running a two-node cluster, do not rely on pvecm expected 1. It's a temporary fix for recovery, not a configuration. Get a QDevice. Even a $35 Raspberry Pi is better than a cluster that goes read-only during a midnight update.

I also found that hardware stability plays a huge role in quorum health. If you're seeing random "Node lost" messages in your logs but the server is still pingable, check your kernel settings. I've dealt with AMD Ryzen C-State freezes that looked like network failures but were actually the CPU dropping into a sleep state so deep the NIC stopped responding for a few milliseconds, triggering a Corosync timeout.

A few final caveats:

QDevice Placement: Don't run your QDevice as a VM on the same cluster it's voting for. That's circular logic. If the cluster loses quorum and the VM stops, the QDevice disappears, and you're stuck. Put it on a separate physical box or a different hypervisor.
Network Latency: Corosync is extremely sensitive to latency. If you're putting your QDevice in the cloud or on a slow Wi-Fi link, you'll see "flapping" where the cluster constantly gains and loses quorum. Use a wired connection.
The "Expected" Trap: When you manually change pvecm expected, you are telling the cluster to ignore the safety rules. Only do this when you are performing maintenance on a known-down node and need to regain control of the surviving one.

If you're scaling this into a production-grade environment, this is where the gap between a "homelab" and "infrastructure" becomes clear. For those needing professional help architecting these systems for zero-downtime, I provide infrastructure consulting to handle the messy parts of bare-metal orchestration.

Kyverno Admission Controllers: Policy-as-Code That Actually Works

Guatu — Mon, 18 May 2026 02:15:57 +0000

I spent an entire Saturday afternoon debugging why my CloudNativePG (CNPG) database cluster refused to initialize, only to find out my own security policies were killing the initdb jobs. I had a "require-resource-limits" policy active across the cluster. It sounded like a great idea: no pod enters the cluster without explicit CPU and memory limits. The documentation makes this look like a five-minute win for cluster stability.

What the docs don't tell you is that many Kubernetes Operators, including CNPG, spawn temporary Jobs or Pods that don't always inherit the limits you've defined in the primary custom resource. The admission controller saw a pod without limits, deemed it "illegal," and blocked it. The operator just kept retrying, and I kept wondering why my database was stuck in a pending state with no obvious error in the operator logs.

This is the gap between "Policy-as-Code" as a concept and Policy-as-Code in a real production environment. If you've ever tried to enforce standards across a multi-node cluster, you've probably looked at OPA Gatekeeper or Kyverno. I've used both. One requires you to learn a specialized language (Rego) that feels like a full-time job, and the other uses YAML.

Why you'd choose a Policy Engine

You reach this decision point when your cluster grows beyond a few hand-rolled manifests. Once you're using ArgoCD to scale your apps, you stop caring about individual pods and start caring about invariants.

These invariants usually fall into a few buckets:

No one runs a container as root.
Every deployment has a specific set of labels for monitoring.
Resource limits are enforced so one runaway AI agent doesn't starve the rest of the node.
Sidecars are automatically injected without manually editing every deployment.

You can do some of this with Pod Security Admissions (PSA), but PSA is a blunt instrument. It's a "yes or no" switch. A real admission controller allows you to mutate the request on the fly. If a developer forgets a security context, the controller doesn't just reject the pod; it injects the correct one.

Option A: OPA Gatekeeper

Gatekeeper is the industry standard for large-scale enterprises. It's built on Open Policy Agent (OPA), and its primary strength is its absolute precision.

Strengths
The logic is decoupled from the Kubernetes API. Because it uses Rego, you can write incredibly complex queries. If you need a policy that says "Allow this pod only if the user is in the 'dev' group AND the time is between 9 AM and 5 PM AND the image comes from a specific signed registry," Gatekeeper can do it.

Weaknesses
The learning curve is a cliff. Rego is a declarative query language, and if you've never used it, you'll spend more time fighting the syntax than actually securing your cluster. Debugging a failing Rego policy is a nightmare because the error messages are often opaque.

When it shines
Gatekeeper is for environments where compliance is a legal requirement. If you're in a highly regulated industry where you need a mathematical proof of your security posture, the overhead of Rego is worth it.

Option B: Kyverno

Kyverno is the choice for those of us who just want things to work without learning a new language. It uses YAML for everything.

Strengths
It's native to Kubernetes. If you can write a Pod manifest, you can write a Kyverno policy. It handles mutation, validation, and generation. The "generation" part is a killer feature: you can tell Kyverno that whenever a new namespace is created, it should automatically generate a NetworkPolicy and a LimitRange for that namespace.

Weaknesses
YAML has limits. While Kyverno is powerful, it can't match the raw computational logic of Rego for extremely complex edge cases. It's also easier to accidentally create "mutation loops" where a policy changes a resource, which triggers the policy again, ad infinitum.

When it shines
It's perfect for the GitOps-driven homelab or mid-sized production environment. It integrates cleanly with manifest validation pipelines and doesn't require a dedicated "Policy Engineer" to maintain.

Decision Framework

Criterion	OPA Gatekeeper	Kyverno
Language	Rego (Specialized)	YAML (K8s Native)
Learning Curve	Steep	Shallow
Mutation	Possible, but complex	First-class citizen
Resource Generation	No	Yes
Performance	Extremely high	High
Configuration	ConstraintTemplates	ClusterPolicies
Ideal User	Compliance/Security Teams	DevOps/Platform Engineers

My Pick and Why

I use Kyverno. I've tried the "right way" with OPA, but in a lean environment, the cognitive load of Rego is a liability. I'd rather spend my time optimizing my AI agent orchestration than debugging a query language.

However, using Kyverno without a strategy is a fast track to a broken cluster. To make it actually work, you have to move away from the "happy path" and account for infrastructure overhead.

The "Infrastructure Exclusion" Pattern

The biggest mistake I made early on was applying policies globally. I had a policy that required all pods to have a specific security context. Suddenly, my Traefik ingress and ArgoCD controllers started crashing because they needed specific capabilities (like NET_ADMIN) that my policy explicitly forbade.

The fix is to implement a strict exclusion list. You cannot treat your infrastructure components the same way you treat your application workloads. I now use a combination of namespace exclusions and label-based filters to ensure that the "plumbing" of the cluster stays functional.

Here is how I handled the CNPG issue. Instead of a blanket "require limits" policy that blocks everything, I added an exclusion for any resource tagged by the CNPG operator.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  rules:
    - name: require-resource-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
      generate:
        kind: LimitRange
        name: default-limit-range
        namespace: $(metadata.namespace)
        applyTo: Pod
        spec:
          limits:
            - type: Container
              max:
                memory: 512Mi
      exclude:
        any:
          - labels:
              cnpg.io/cluster: "*"

This policy ensures that most pods get a default limit range, but it stays out of the way of the database operator's internal jobs.

Handling Security Contexts without Breaking the Cluster

Another common pitfall is forcing security contexts on pods that actually need to run as root to perform system-level tasks. I've seen this happen with storage drivers and network plugins.

I prefer a "mutate-then-validate" approach. I use Kyverno to inject a sane default security context for everything, and then I create a small set of exceptions for the system namespaces.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: default-security-context
spec:
  rules:
    - name: set-default-security-context
      match:
        any:
          - resources:
              kinds:
                - Pod
      # I use mutate here instead of generate to ensure the pod 
      # spec is modified before it hits the scheduler
      mutate:
        patchStrategicMerge:
          spec:
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              fsGroup: 2000
              supplementalGroups: [2001]

If you apply this globally, you'll likely break your CNI or your CSI driver. You must exclude kube-system and any namespace where you've deployed low-level infrastructure.

The Danger of `synchronize: true`

Kyverno has a setting called synchronize. When set to true, Kyverno will automatically update the generated resource if the policy changes. This sounds great in theory, but in practice, it can create a synchronization nightmare.

I once had a policy generating NetworkPolicies for every new namespace. I changed the policy to add a new rule, and Kyverno attempted to update every single NetworkPolicy in the cluster simultaneously. This caused a spike in API server latency and, for a few minutes, left some of my internal services unreachable because the policies were in a state of flux.

My rule of thumb now is to avoid synchronize: true for high-churn resources. If you need to update a generated resource across the cluster, it's safer to trigger a rolling update via your GitOps pipeline than to let the admission controller try to rewrite the cluster state on the fly.

Orphaned Resources and the Cleanup Gap

Policy engines are great at creating things, but they're often bad at cleaning them up. I ran into this with a dashboard app called Homarr. I had a policy that generated certain config maps for the dashboard. When I deleted the application via the API, the generated resources stayed behind.

This led to "phantom" items appearing in my dashboard UI. The application was gone, but the configuration lived on in the etcd store. Kyverno doesn't always track the lifecycle of generated resources perfectly.

If you find yourself with orphaned records in your database or config stores, you might have to go in manually. For Homarr, I had to run a few SQL queries to purge the dead references:

-- Clean up orphaned item_layout and item records
DELETE FROM item_layout WHERE itemId NOT IN (SELECT id FROM item);
DELETE FROM item WHERE app_id NOT IN (SELECT id FROM app);

It's a reminder that while "Policy-as-Code" automates the deployment, it doesn't always automate the decommissioning.

Integration with the Wider Stack

A policy engine shouldn't exist in a vacuum. I've found that the most stable setups link Kyverno with other infrastructure tools. For example, I use it to ensure that any ingress resource created in the cluster has the correct annotations for cert-manager and Cloudflare DNS-01.

Instead of reminding every developer to add the cert-manager.io/cluster-issuer annotation, I wrote a mutation policy that adds it automatically if the ingress is in a production namespace. This removes the human element from the TLS chain.

Similarly, I use Kyverno to enforce that all SealedSecrets are tagged with an owner label. This makes it significantly easier to track who owns which secret when I'm auditing the cluster for old, unused credentials.

Lessons Learned

The biggest takeaway from my time with admission controllers is that the "happy path" is a lie. The documentation shows you how to block a pod, but it doesn't show you the three hours of debugging you'll do when a system-critical operator gets blocked by that same policy.

I've learned to follow three strict rules:

Test in a sandbox. Never apply a new ClusterPolicy to a production cluster without running it in audit mode first. Kyverno's audit mode lets you see what would have been blocked without actually blocking it.
Exclude the plumbing. Your infrastructure (Traefik, ArgoCD, CNPG, etc.) should almost always be exempt from general application policies.
Keep it simple. If a policy requires more than a few lines of complex YAML logic, it's probably time to ask if that constraint should be handled at the CI/CD level rather than the admission level.

I've moved toward using manifest validation in CI to catch the obvious errors before they ever hit the API server. This reduces the load on the admission controller and provides faster feedback to the person writing the YAML.

If you're building out your own infrastructure and need help designing a secure, automated pipeline for AI agents or industrial systems, you can check out my services. I focus on the gap between the documentation and the actual production reality, which is usually where the most expensive bugs live.

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

Guatu — Fri, 15 May 2026 16:15:32 +0000

I spent three hours debugging a "hallucination" in my agent's daily briefing only to realize the agent wasn't hallucinating at all. It had simply failed to access my local financial spreadsheets because of a tool denylist I'd configured for security, and instead of admitting it couldn't see the data, it had tried to "guess" based on a few fragments it had previously cached in a cloud-based session. Even worse, I discovered that a fallback trigger in my orchestration layer had sent a summarized snippet of my private data to a cloud API because the local inference node had a momentary timeout.

If you're building AI agents that touch real-world data, the "happy path" is usually just a prompt and an API key. The reality is a minefield of data leaks, prompt injections, and silent failures that send your private keys or bank statements to a third-party server because a local GPU pod decided to restart.

This is a problem for anyone running autonomous agents that have read or write access to a local filesystem. If your routing logic is flawed, your privacy isn't a policy; it's a coin flip.

The Wrong Way: Trusting the Orchestrator

My first attempt at "privacy" was naive. I used a simple conditional in my agent's logic: if the query contained words like "bank," "password," or "private," route it to a local Ollama instance. Otherwise, send it to GPT-4o.

This failed immediately for three reasons. First, keyword filtering is a joke. A user (or a prompt injection) can easily bypass "bank" by asking about "financial liquidity instruments." Second, I assumed the orchestrator was a neutral party. In reality, the orchestrator often handles the context window, meaning the sensitive data is already in the prompt before the routing decision is even made. Third, I had no fail-safe. When the local model timed out, the system defaulted to the cloud provider to ensure "high availability." In a privacy-first system, unavailability is better than exposure.

I also hit a wall with tool access. I had disabled sandbox.mode to let my agents actually do work, but I quickly found that built-in tools like read and edit can be manipulated to bypass exec allowlists. I saw a specific instance where a prompt injection convinced the agent to use a read-chunk command (a hidden utility in some knowledge base scripts) to dump raw data from a file that should have been summarized first.

The Actual Solution: Two-Tier Privacy Routing

The only way to actually guarantee privacy is to move the routing logic as close to the data as possible and treat the cloud LLM as an untrusted guest. I implemented a two-tier architecture: a local "Privacy Gate" and a reference-only knowledge base.

1. The Reference-Only Knowledge Base

Instead of feeding raw files to the LLM, I use a system where the LLM never sees the original document. I use poppler-utils for PDF extraction and a local embedding model to populate a Qdrant vector store. The agent queries the vector store, but the results are filtered through a local script before being sent to any inference engine.

2. The Privacy Gate (Routing Layer)

I wrote a wrapper, knowledge.sh, that handles the routing. It doesn't rely on keywords. It relies on the data source. If the data comes from a "Sensitive" tagged volume in my cluster, the request is hard-pinned to the local GPU node.

Here is a simplified version of how I handle a private query:

#!/bin/bash
# knowledge.sh query - Local-first routing

QUERY=$1
MODEL="qwen2.5:14b"
# The local endpoint is a dedicated GPU node in my K8s cluster
LOCAL_ENDPOINT="http://ollama-gpu-node.internal/v1/chat/completions"

# Check if the query requires sensitive data access
if [[ "$QUERY" == *"--private"* ]]; then
    echo "Routing to local inference..."
    # We use a local model and a local endpoint. No cloud fallback.
    curl -X POST "$LOCAL_ENDPOINT" \
         -H "Content-Type: application/json" \
         -d "{
           \"model\": \"$MODEL\",
           \"messages\": [{\"role\": \"user\", \"content\": \"$QUERY\"}],
           \"stream\": false
         }"
else
    # Non-sensitive queries can go to the cloud orchestrator
    ./route-to-cloud.sh "$QUERY"
fi

3. Hardening the Execution

To prevent the "hallucination via missing data" problem, I stopped letting the LLM handle the final delivery of sensitive reports. I use a pattern where the LLM generates a template or a summary, but a local Python script handles the actual data insertion and delivery.

For my daily briefings, I use a wrapper script that ensures the data collection is isolated from the cloud inference:

#!/bin/bash
# life-briefing-run.sh

# 1. Collect raw data locally (Private)
./daily-briefing.sh --collect-only

# 2. Format the data using a local script (No LLM involved here)
# This prevents the LLM from accidentally leaking raw data in its output
python3 /opt/scripts/format-and-send-briefing.py

And the Python script handles the delivery via a secure API (like Telegram) without ever sending the raw content to a third-party LLM for "polishing":

import json
import requests

def send_telegram_message(message):
    # Tokens are managed via SealedSecrets in K8s
    bot_token = 'ANONYMIZED_TOKEN'
    chat_id = 'ANONYMIZED_ID'
    url = f'https://api.telegram.org/bot{bot_token}/sendMessage'
    payload = {
        'chat_id': chat_id,
        'text': message,
        'parse_mode': 'Markdown'
    }
    requests.post(url, json=payload)

# Load the locally generated briefing
with open('/tmp/briefing.txt', 'r') as f:
    content = f.read()
    send_telegram_message(content)

Why This Works

This approach works because it removes the "decision" from the LLM. If you ask an LLM "Should I send this to the cloud?", it will eventually say yes. By moving the routing to a bash wrapper and a Python script, the logic is deterministic.

The use of a local model like qwen2.5:14b via Ollama provides enough reasoning capability to summarize private data without needing the massive parameter counts of GPT-4. I've found that for most RAG (Retrieval-Augmented Generation) tasks, a 14B model is the sweet spot between performance and the VRAM limits of my GPU nodes.

By separating the synthesis (LLM) from the delivery (Python script), I've created a circuit breaker. Even if the LLM is compromised via prompt injection, it cannot "leak" the data to the cloud because it doesn't have the API keys for the cloud provider; those are held by the orchestrator, which is gated by the knowledge.sh script.

For those managing the underlying hardware, ensuring these local models stay performant requires a stable infrastructure. I've written about how I handle GPU passthrough on Proxmox and why the NVIDIA Container Toolkit is non-negotiable for this to work in a Kubernetes environment.

Lessons Learned

The biggest surprise was how often "convenience" features in agent frameworks are actually security holes. For example, I found that sessionKey in some cron-job implementations is often misunderstood. I assumed it provided hard isolation, but it turns out it's often just a routing hint. To get actual isolation, you have to explicitly set the session to isolated, or you risk your private data bleeding into the "main" session context, which might be shared with a cloud-connected agent.

Another gotcha was the Qdrant MCP. I hit several "Not existing vector name" errors during the rollout. This wasn't a bug in my code but a version mismatch between the MCP server and the Qdrant instance. In a bare-metal K8s setup, pinning your versions is the only way to avoid waking up to a broken pipeline.

If I were to do this again, I'd implement a more formal "Taint and Toleration" system in Kubernetes. I'd taint my GPU nodes with privacy=high and only allow pods with the corresponding toleration to run there. This would prevent a non-private, cloud-connected pod from ever being scheduled on the same physical hardware where my sensitive local models are processing data in memory.

For those looking to scale this into a professional environment, this kind of architecture is a core part of what I do in AI agent and infrastructure consulting. Moving from a "it works on my machine" script to a production-grade, privacy-routed pipeline is where most of the complexity lives.

The takeaway is simple: if the data is sensitive, the cloud is a liability. Build your gate, pin your models, and never let your LLM decide where your data goes.

Tailscale Subnet Routers: Accessing Your LAN Without the VPN Headache

Guatu — Thu, 14 May 2026 02:15:07 +0000

I spent three hours trying to SSH into a legacy industrial gateway from a coffee shop, only to realize I'd forgotten to install the Tailscale agent on that specific piece of hardware. The device was a locked-down firmware image where "installing a binary" isn't an option. That's the moment I stopped trying to put Tailscale on every single node and instead shifted to a dedicated subnet router.

If you have a multi-node Proxmox cluster, a rack of IoT sensors, or a bunch of "dumb" switches and PDUs, you can't possibly install a client on everything. You need a way to tell your Tailnet: "If you're looking for anything in the 10.0.0.x range, just send the traffic to this specific Linux box, and it'll handle the rest."

The Concept: Routing vs. Agent-based Access

Standard Tailscale is a mesh. Every device is a peer. This is great for your laptop and your primary workstation, but it's a nightmare for infrastructure. A subnet router turns a single node into a gateway. It acts as a bridge between the encrypted WireGuard mesh and your local physical network.

The magic here is that the devices on your LAN don't even know Tailscale exists. They just see traffic coming from the subnet router's local IP. You get the security of a private mesh without having to touch the network configuration of your legacy gear or your Kubernetes pods.

Implementation: The "Happy Path" and the Reality

The official docs make this look like a one-line command. While that's technically true, if you're running this on a production-grade homelab or a bare-metal node, there are a few kernel-level requirements that usually get glossed over.

First, you have to enable IP forwarding. If the Linux kernel isn't allowed to pass packets between interfaces, your subnet router is just a fancy wall.

# Enable IPv4 forwarding immediately
sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1

# Make it persist across reboots
echo "net.ipv4.ip_forward = 1" | sudo tee -a /etc/sysctl.d/99-tailscale.conf
echo "net.ipv6.conf.all.forwarding = 1" | sudo tee -a /etc/sysctl.d/99-tailscale.conf
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf

Once the kernel is ready, you bring Tailscale up. I've found that explicitly forcing the kernel TUN interface is safer than relying on the default, especially if you've experimented with userspace networking in the past. Userspace networking (--tun=userspace-networking) is a death sentence for subnet routing; it simply won't work because the OS doesn't see the interface as a routable device.

# Start Tailscale as a subnet router for a specific range
# Replace 10.0.0.0/24 with your actual local subnet
sudo tailscale up --tun=kernel --advertise-routes=10.0.0.0/24

After running this, you still aren't connected. You have to go into the Tailscale Admin Console and manually approve the routes. This is a security feature to prevent a compromised node from suddenly hijacking all traffic for your entire network.

The Kubernetes and Gateway Trap

If you're running your subnet router inside a container or as part of a larger orchestration layer, you'll likely run into the gateway.bind issue. I hit this while integrating a gateway with some Kubernetes services.

When using tools like OpenClaw or custom wrappers, the default binding often fails because the application tries to bind to an interface that isn't actually the LAN. If your config looks like a generic default, you'll see the node is "online" in the dashboard, but you can't ping anything on the local subnet.

You need to explicitly tell the gateway to bind to the LAN interface. In the JSON config, it looks like this:

{
  "gateway": {
    "bind": "lan"
  }
}

Without this, the traffic often loops back or hits a dead end in the container network. This is similar to the networking headaches I've dealt with regarding DNS resolution, like the Wildcard DNS and ndots:5 nightmare, where the system thinks it knows where to go but the underlying routing logic is fundamentally flawed.

Turning it into an Exit Node

A subnet router lets you reach your home from the outside. An exit node lets you send all your internet traffic through your home from the outside. It's the difference between "I want to see my Proxmox UI" and "I'm on public WiFi and I want to pretend I'm at home for security."

To do this, add the --advertise-exit-node flag:

sudo tailscale up --advertise-routes=10.0.0.0/24 --advertise-exit-node

Here is where most people get stuck. You'll enable the exit node, select it in your client, and then realize you have zero internet access. The packets are reaching your router, but they aren't being NAT'd back out to the web. You need a MASQUERADE rule in your iptables to handle the translation.

# Replace eth0 with your actual primary network interface
sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

If you're using a modern distro with nftables, you'll need the equivalent rule there. If you don't do this, the return packets from the internet don't know how to get back to the Tailscale client because they're coming from a virtual IP range the rest of your network doesn't recognize.

The Advanced Headache: NAT-PMP and Network Namespaces

In some complex environments, specifically when dealing with certain industrial gateways or strict NAT setups, you'll encounter NAT-PMP misrouting. This happens when Tailscale tries to be too smart about the local network and accidentally routes requests to a remote subnet router instead of the local one.

The fix is ugly, but it works: isolate the Tailscale traffic into its own network namespace (netns). This prevents the daemon from interfering with the host's primary routing table in ways that cause loops.

# Create a dedicated namespace for Tailscale
ip netns add tailscale
ip link add veth0 type veth peer name veth1
ip link set veth0 netns tailscale

# Assign an IP to the virtual interface
ip addr add 10.0.0.2/24 dev veth1
ip link set veth1 up

# Run the daemon inside the namespace
ip netns exec tailscale tailscaled

This is overkill for 90% of homelabbers, but if you're building out automated infrastructure with OpenTofu and deploying these routers across multiple sites, you'll want this kind of isolation to ensure stability.

Comparison: Subnet Routers vs. Traditional VPNs

I've run OpenVPN and WireGuard (manual) for years. Here is the honest breakdown of why I switched to Tailscale subnet routers for my remote access.

Feature	Traditional VPN (OpenVPN/WireGuard)	Tailscale Subnet Router
Setup	Manual certs, port forwarding, firewall rules	Zero-config NAT traversal, OAuth
Client Mgmt	Distributing `.ovpn` or `.conf` files	Log in with SSO/Identity provider
Routing	Manual static routes on clients	Centralized route management in console
Maintenance	Updating keys, managing IP pools	Automatic key rotation, managed IPs
Performance	High (if tuned correctly)	High (WireGuard based)

The tradeoff is the "phone home" aspect. Tailscale's coordination server knows which nodes are online. For most of us, that's a fair price to pay to avoid spending a Saturday morning debugging why a UDP port isn't opening on a residential ISP.

Gotchas and Lessons Learned

If you're setting this up, watch out for these three things:

The "Double-Hop" Latency: If you use a subnet router and then an exit node on a different machine, your traffic is bouncing across your network multiple times. It's fine for SSH, but terrible for VoIP or gaming. Keep your subnet router and exit node on the same high-performance machine if possible.
DNS Leaks: Just because you can route to 10.0.0.x doesn't mean your DNS is working. You'll still be typing IPs unless you configure "MagicDNS" or set up a global nameserver in the Tailscale admin panel that points to your internal DNS (like AdGuard Home).
Fail-Closed Policies: By default, if your subnet router goes down, you lose access to the entire LAN. If this is for a production environment, I highly recommend setting up two subnet routers in different failure domains. Tailscale doesn't do "automatic failover" in the traditional sense, but you can have multiple nodes advertising the same route.

Final Thoughts

The subnet router is the only sane way to manage remote access to a complex lab. It separates the "connectivity" layer from the "device" layer. You don't need to care if your old NAS doesn't support WireGuard or if your industrial PLC has a proprietary OS. You just need one stable Linux box with ip_forwarding enabled and a couple of iptables rules.

Reach for this technique the moment you find yourself saying, "I wish I could just SSH into this thing without having to install a client on it." If you're looking to scale this into a larger professional setup, feel free to check out my infrastructure consulting services for help with AI agent orchestration or bare-metal networking.

PCIe Device Passthrough: NIC Name Instability and MAC Pinning

Guatu — Fri, 08 May 2026 04:15:19 +0000

My Proxmox node rebooted, and suddenly the host was unreachable via SSH. I had to plug in a physical monitor and keyboard only to find that my primary network interface, which had been enp4s0 for months, had decided to rename itself to enp5s0.

Because my /etc/network/interfaces file was explicitly tied to enp4s0, the bridge didn't come up, the IP wasn't assigned, and I was locked out of my own hardware.

What I expected

I expected the Linux kernel to consistently enumerate my PCIe devices. In a static hardware environment where nothing has moved, the PCI bus address should be deterministic. If the NIC is plugged into the same slot and the BIOS hasn't changed, enp4s0 should stay enp4s0 forever. This is the "happy path" most documentation assumes.

What actually happened

The reality is that PCIe enumeration is not always a constant. I'm using a mix of onboard NICs and a PCIe expansion card. I also have a GPU passed through to a VM.

The surprise here is how the kernel's predictable network interface naming (systemd-udevd) interacts with the PCIe topology. When I added a new PCIe device and tweaked some BIOS settings for IOMMU, the way the kernel mapped the physical slots to the virtual naming changed. A slight shift in how the PCIe switch reported the devices caused the index to jump.

This isn't just a "one-time fluke." If you're running a multi-node cluster or using GPUs that might move addresses (something I've documented before in GPU PCI Address Instability), you'll find that the kernel is surprisingly flexible with where it puts things.

The root cause is that enp4s0 is a name derived from the PCI location. If the location changes—even by one digit—the name changes. If your network config depends on that name, your system is one reboot away from a blackout.

The Fix: MAC Pinning

The only way to stop this is to stop relying on the PCI slot location and start relying on the hardware's unique identifier: the MAC address.

I decided to use systemd .link files. This allows me to tell the kernel: "I don't care where this device is on the PCIe bus; if it has this MAC address, call it eth0."

1. Identify the MAC address

First, I had to find the actual MAC of the problematic NIC while I had console access.

ip link show

I looked for the interface that was currently named enp5s0 (the "wrong" name) and copied the link/ether value.

2. Create the .link file

I created a custom link file in /etc/systemd/network/. I chose the name 10-lan.link to ensure it loads early in the boot process.

# /etc/systemd/network/10-lan.link
[Match]
MACAddress=00:11:22:33:44:55

[Link]
Name=eth0

(Note: I've anonymized the MAC address above. Use your actual hardware MAC here.)

3. Update the network configuration

Once the interface is pinned to eth0, I had to update the Proxmox network configuration to match. I edited /etc/network/interfaces to replace the volatile enp4s0 with the stable eth0.

# Example snippet from /etc/network/interfaces
auto eth0
iface eth0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.x/24
    gateway 10.0.0.1
    bridge-ports eth0
    bridge-stp off
    bridge-fd 0

4. Apply and verify

I ran systemd-networkd-restart (or just rebooted, since I was already at the console) and verified the name with ip a. The NIC was now consistently eth0, regardless of whether the PCIe bus shifted.

Why this matters

If you're just running a single VM on a desktop, this is a minor annoyance. But if you're building a production-grade homelab, this is a critical failure point.

You'll hit this specifically in these scenarios:

Adding/Removing PCIe Hardware: Adding a new NVMe drive or a GPU can shift the enumeration of other devices on the same root complex.
BIOS Updates: A BIOS update often resets PCIe lane bifurcation or IOMMU settings, which can completely reorder how the kernel sees your NICs.
Using PCIe Switches: Some high-end motherboards or riser cables use PCIe switches that can report different topologies depending on the power state of the devices.

The Tradeoff

The tradeoff here is that you're moving away from the "modern" predictable naming convention back to the "old" ethX style. Some people find eth0 ugly or outdated, but in a headless server environment, "ugly" is better than "unreachable."

I've also seen people try to fix this using udev rules in /etc/udev/rules.d/. While that works, .link files are the native systemd way to handle this and are generally cleaner to maintain.

Lessons Learned

The biggest lesson here is that documentation for Proxmox and Debian assumes your hardware topology is a constant. It isn't.

When you're doing complex things like PCIe passthrough—which I've detailed in my GPU Passthrough Gotcha Guide—you are intentionally messing with the PCI bus. You're telling the host kernel to ignore certain devices so the VM can claim them. This volatility is a side effect of that power.

If you are passing through NICs or GPUs, do not trust the default interface names. Pin your critical management interfaces to their MAC addresses immediately. It takes five minutes to set up and saves you from a midnight trip to the server rack because a reboot decided your network card now lives at enp6s0.

For those of you managing larger fleets or complex AI agent infrastructure, this kind of hardware-level stability is the foundation. You can't build a reliable multi-agent AI pipeline if the underlying Kubernetes worker nodes are randomly losing their network identity.

Next time you're configuring a new node, don't just copy the enpXsX name from the GUI. Take the extra step to pin it. Your future self will thank you when the next BIOS update doesn't break your entire cluster.

GPU PCI Address Instability: When Your Card Moves Between Reboots

Guatu — Thu, 07 May 2026 00:15:04 +0000

I spent an entire afternoon debugging a VM that refused to boot, only to find out my GPU had decided to change its PCI address. One reboot and the device that lived at 01:00.0 suddenly migrated to 02:00.0. Because my Proxmox VM configuration was pinned to the old address, the VM crashed with a QEMU assertion error, and the GPU simply vanished from the guest.

This usually happens because of how the BIOS handles PCIe enumeration during POST. If you have multiple PCIe devices or a complex motherboard topology, the bus numbering isn't always deterministic. This is compounded by AMD Ryzen C-states or weird UMA frame buffer settings that can delay device initialization, causing the kernel to assign addresses in a different order than the previous boot. If you've already dealt with AMD iGPU RAM theft, you know how sensitive these BIOS settings are.

If you're on Proxmox 8.4+, the "happy path" is to use the q35 machine type. The older i440fx is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.

To stabilize this, I switched the VM to q35 and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.

# 1. Change VM to q35 machine type for better PCIe support
qm set <VMID> --machine q35

# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device
# Replace <PCI_ADDRESS> with your current address (e.g., 0000:01:00.0)
qm set <VMID> -hostpci0 <PCI_ADDRESS>,pcie=1

# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)
# Run this on the Proxmox host
echo 0 > /sys/bus/pci/devices/0000:<PCI_BUS>:<PCI_SLOT>.0/d3cold_allowed

If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the NVIDIA Container Toolkit to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.

The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.

Cognitive Memory for Agents: Vector Search vs Activation-Based Recall

Guatu — Wed, 06 May 2026 22:15:04 +0000

I spent a few weeks trying to build an agent that could remember specific user preferences across sessions without bloating the context window to a point where latency became unbearable. The standard advice is always "just use a vector database." But as the memory store grew, I noticed a weird gap: the agent could find a document about "user prefers dark mode" via cosine similarity, but it couldn't "recall" the immediate emotional state or the nuance of the last three turns of conversation unless they were explicitly mirrored in the embedding.

The problem is that vector search is a retrieval mechanism, not a cognitive memory system. When you move from simple RAG to actual agentic memory, you have to choose between external vector search and internal activation-based recall.

The Decision Point

You face this choice when your agent's "short-term" memory (the context window) is full, and your "long-term" memory (the database) is returning results that are mathematically similar but contextually irrelevant.

If you need your agent to remember a 500-page technical manual, you need a vector store. If you need your agent to exhibit a consistent "personality" or recall a specific pattern of behavior that isn't easily summarized into a string of text for an embedding model, you need something closer to activation-based recall.

Option A: Vector Search (The External Archive)

Vector search is the industry standard for a reason: it's easy to scale and the tooling is mature. You turn a piece of text into a vector using an embedding model (like text-embedding-3-small), shove it into a store like FAISS or Milvus, and query it with another vector.

Strengths:

Scale: You can store billions of vectors.
Cold Storage: It doesn't eat VRAM. It lives on disk or in a dedicated database.
Interpretability: I can literally query the database and see exactly which chunk of text was retrieved.

Weaknesses:

The "Semantic Gap": Cosine similarity is a blunt instrument. If a user says "That's not what I meant," a vector search might retrieve a passage about "meaning" or "intent" rather than understanding the correction.
Latency: You have to embed the query, hit the DB, and then stuff the results into the prompt.

Here is a basic implementation using FAISS. I use this for the "knowledge base" layer of my agents:

import faiss
import numpy as np

# Dimension depends on your embedding model (e.g., 1536 for OpenAI)
dimension = 128 
nb = 1000  # number of memory chunks
index = faiss.IndexFlatL2(dimension) 

# Mocking embeddings of agent experiences
vectors = np.random.random((nb, dimension)).astype('float32')
index.add(vectors) 

# Querying for the top 4 most similar memories
queries = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(queries, 4) 
print(f"Retrieved memory indices: {indices}")

Option B: Activation-Based Recall (The Internal Intuition)

Activation-based recall is more akin to how biological memory works. Instead of searching a database, the "memory" is stored in the weights or the hidden states of the model. In modern agent architectures, this often involves using activation hooks or specialized memory layers (like Memory Transformers) that allow the model to trigger a recall based on the current internal state of the network.

Strengths:

Speed: There is no external API call or DB lookup. The recall happens during the forward pass.
Nuance: It captures "how" something was said, not just "what" was said. It's an associative trigger rather than a keyword search.

Weaknesses:

The Black Box: Debugging this is a nightmare. You can't just "look" at the database to see why the agent recalled a specific memory.
VRAM Pressure: Storing these activations or maintaining a dynamic memory network consumes precious GPU memory.

I've experimented with simple activation hooks in PyTorch to track which "states" trigger certain behaviors. It's not a full-blown Memory Transformer, but it's a start:

import torch
from torch import nn

class AgentModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.memory_buffer = []

    def forward(self, x):
        # In a real system, this would be a specific layer's activation
        # that represents a 'concept' or 'state'
        activation = torch.tanh(x) 

        # Store the activation state for later recall/analysis
        self.memory_buffer.append(activation.detach().cpu().numpy())
        return activation

model = AgentModel()
input_tensor = torch.rand(1, 128)
output = model(input_tensor)
print(f"Stored state vector: {model.memory_buffer[-1]}")

Decision Framework

Criteria	Vector Search	Activation-Based Recall
Data Volume	Massive (TB+)	Small (MB to GB)
Retrieval Speed	Milliseconds (Network/Disk)	Microseconds (GPU)
Precision	Semantic/Keyword	Associative/Pattern
Debugging	Easy (Query the DB)	Hard (Analyze Tensors)
Resource Cost	CPU/Disk/API	VRAM/Compute

My Pick and Why

I don't pick one. I use a hybrid.

If you're building a production agent, relying solely on vector search leads to that "robotic" feeling where the agent repeats the same retrieved snippet regardless of the conversation flow. Relying solely on activations is a recipe for a system you can't debug when it starts hallucinating.

I implement a tiered system. I use a vector store for the "Library" (hard facts, documentation) and a sliding window of activations for the "Working Memory" (current mood, immediate goals, recent corrections). This mirrors the 6-layer memory architecture I've used for my own tools.

For those building multi-agent systems, I recommend offloading the vector search to a shared service and keeping the activation-based recall local to the agent's specific instance. This prevents the "shared memory" from becoming a noisy mess of conflicting embeddings. You can see how this fits into larger patterns in my post on multi-agent architecture patterns.

If you're still struggling with agents that forget things every five minutes, you might be hitting a safety loop. I've written about three-layer safety for autonomous agents which often solves the "infinite loop" problem that people mistake for a memory issue.

If you need help designing a memory architecture that doesn't melt your GPU or your budget, check out my AI agent consulting services.

Lessons learned:
The docs for vector DBs make it sound like they replace the need for cognitive memory. They don't. They replace the need for a filing cabinet. If you want an agent that actually "feels" like it's learning from a conversation in real-time, you have to move closer to the activations.

Vibration Monitoring Architecture: From Sensor to Dashboard

Guatu — Wed, 06 May 2026 16:15:04 +0000

The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought I'd just wrap those values in JSON and send them over the wire. The result wasn't a pretty graph; it was a series of Connection refused errors and a broker that had completely locked up under the weight of thousands of tiny packets per second.

If you're building a vibration monitoring system, you're not just dealing with "IoT data." You're dealing with signal processing. There is a massive difference between reporting a temperature every 30 seconds and capturing the harmonic frequencies of a motor bearing. If you treat vibration data like any other telemetry, your network will choke, your database will bloat, and your dashboards will be useless.

What I tried first (The wrong way)

My initial assumption was that the "modern stack" (Sensor $\rightarrow$ MQTT $\rightarrow$ Time Series DB $\rightarrow$ Grafana) would handle everything. I used a cheap industrial sensor that output raw voltage via a 4-20mA loop, fed into a PLC, which then pushed data to a Python script on a Raspberry Pi.

I wrote a simple loop that read the sensor and published to a topic:

# DO NOT DO THIS
while True:
    val = sensor.read() 
    client.publish("factory/machine1/vibration", json.dumps({"value": val}))

I quickly hit three walls:

Network Saturation: Sending one MQTT packet per sample is an architectural sin. The overhead of the TCP/IP stack and MQTT headers is larger than the actual payload. I was spending 90% of my bandwidth on headers.
Database Explosion: InfluxDB is great, but inserting 5,000 points per second per sensor is a recipe for a disk space crisis. My cardinality exploded, and queries that should have taken milliseconds started taking 30 seconds.
The "Noise" Problem: The raw data was a jagged mess. I couldn't see the actual vibration patterns because the high-frequency electrical noise from the nearby VFDs (Variable Frequency Drives) was masking the mechanical signal.

I realized that the gap between the sensor and the dashboard isn't a straight line. It's a funnel. You have to aggressively reduce the data volume at the edge before it ever touches the network.

The Actual Solution: The Edge-Heavy Pipeline

To make this work, I shifted the intelligence to the edge. The goal is to move from "streaming raw samples" to "streaming features." Instead of sending every single point, I calculate the RMS (Root Mean Square), Peak-to-Peak, and FFT (Fast Fourier Transform) bins locally.

1. Signal Conditioning and Edge Processing

I moved the processing to a dedicated edge gateway. I used a Python-based service that buffers samples in memory, applies a digital filter to remove electrical noise, and calculates the metrics.

Here is the implementation of the signal conditioning and feature extraction:

import numpy as np
from scipy.signal import butter, filtfilt
import paho.mqtt.client as mqtt
import time

# Configuration for a 10kHz sampling rate
FS = 10000 
CUTOFF = 2000 # Remove noise above 2kHz
ORDER = 4

def butter_lowpass_filter(data, cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return filtfilt(b, a, data)

def calculate_features(buffer):
    # Filter the raw signal to remove high-frequency noise
    filtered = butter_lowpass_filter(buffer, CUTOFF, FS, ORDER)

    # Calculate RMS - the primary indicator of overall vibration level
    rms = np.sqrt(np.mean(filtered**2))

    # Calculate Peak-to-Peak
    ptp = np.ptp(filtered)

    # Perform FFT to find the dominant frequency
    fft_vals = np.abs(np.fft.rfft(filtered))
    freqs = np.fft.rfftfreq(len(filtered), 1/FS)
    dominant_freq = freqs[np.argmax(fft_vals)]

    return {
        "rms": float(rms),
        "ptp": float(ptp),
        "dom_freq": float(dominant_freq)
    }

# Main loop: Buffer 1000 samples, then send 1 summary packet
client = mqtt.Client()
client.connect("mqtt-broker.example.com", 1883)

buffer = []
while True:
    val = read_sensor_raw() # Mock function for ADC read
    buffer.append(val)

    if len(buffer) >= 1000:
        features = calculate_features(buffer)
        # Send summary instead of 1000 raw points
        client.publish("iiot/machine1/vibration/features", str(features))
        buffer = [] # Clear buffer

2. The Transport Layer (MQTT 5.0)

For the broker, I shifted from a basic Mosquitto setup to a more controlled configuration. Since vibration data is critical for predictive maintenance, I needed to ensure that the "heartbeat" of the machine was always known.

I used MQTT 5.0 "Will Messages" to detect if a gateway went offline. If the gateway crashes, the broker immediately publishes a "disconnected" status to the health topic, so the dashboard doesn't just show a flat line (which could be mistaken for a stopped machine).

# mosquitto.conf snippet
listener 1883
allow_anonymous false
password_file /etc/mosquitto/passwd
# Prevent the broker from being overwhelmed by slow consumers
max_queued_messages 1000

I've written more about choosing the right broker in my MQTT Broker Selection post, but for vibration, the priority is low latency and high reliability over massive scale.

3. Storage and Visualization

I used InfluxDB 2.x for storage because of its native handling of time-series data. Instead of storing the raw waveform, I store the calculated features. This reduces the storage requirement by 1000x.

In Grafana, I set up a dashboard that monitors the RMS value. However, looking at a raw line graph of vibration is usually useless for operators. They don't know if 0.5g is "bad" or "normal."

I integrated this with a health scoring system. I used a Flux query in InfluxDB to compare the current RMS against a baseline (the average of the last 7 days).

// InfluxDB Flux Query for Relative Vibration
from(bucket: "iiot_data")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "vibration_sensor")
  |> filter(fn: (r) => r["_field"] == "rms")
  |> aggregateWindow(every: 1m, fn: mean)
  |> map(fn: (r) => ({ r with value: r._value / 0.15 })) // Normalize against threshold 0.15g

This feeds directly into the concept of Equipment Health Scoring, where the goal is to give the operator a single "Health %" rather than a complex spectrum analysis.

Why this architecture works

The reason this works is that it respects the laws of physics and networking.

The Nyquist-Shannon Theorem tells us we need to sample at twice the frequency of the signal we want to capture. If you want to detect a bearing fault at 2kHz, you must sample at 4kHz+. Trying to do this over WiFi or Ethernet using standard JSON-over-MQTT is impossible because the packet overhead kills the throughput.

By calculating the RMS and FFT at the edge, we are performing Data Reduction. We transform a high-bandwidth signal (time domain) into a low-bandwidth set of descriptors (frequency domain).

The edge processing also acts as a mechanical filter. By using a Butterworth low-pass filter, I can strip out the 60Hz hum from the power lines and the high-frequency spikes from the VFDs. If you do this in the cloud, you've already wasted the bandwidth sending noise.

Lessons learned and caveats

If I had to build this again, I'd change a few things:

1. Hardware-level filtering: I spent too much time in Python trying to fix signal noise. In a real industrial environment, you should use an analog anti-aliasing filter (a physical capacitor/resistor circuit) before the signal ever hits the ADC. Software filters are great, but they can't fix aliasing if the signal was already corrupted during sampling.

2. The "Buffer" Trap: My Python script used a simple list for the buffer. At very high sampling rates, Python's list appending becomes slow. I had to switch to numpy arrays with pre-allocated memory to avoid garbage collection pauses that caused gaps in the data.

3. Provisioning the Edge: Managing these Python scripts across five different gateways was a nightmare. I eventually moved the deployment to a GitOps flow, using OpenTofu and GitHub Actions to manage the underlying VM configurations on my Proxmox cluster, ensuring every gateway had the exact same version of scipy and numpy.

4. The Dashboard Paradox: The more data I put on the dashboard, the less the operators used it. The final version of the system only shows three things: a Green/Yellow/Red light for health, the current RMS value, and a "Time to Maintenance" estimate. Everything else (the FFT bins, the raw waveforms) is hidden in a "Deep Dive" tab that only the reliability engineer ever opens.

Vibration monitoring is a classic example of where "more data" is actually "less information." The value isn't in the sensor; it's in the reduction process that happens between the sensor and the screen.

Unprivileged LXC + Docker: The runc Sysctl Permission Trap

Guatu — Tue, 05 May 2026 00:15:20 +0000

sysctl: setting key "net.ipv4.ip_local_port_range": Permission denied

I saw this error while trying to tune the network stack for a high-concurrency service running in Docker, which itself was hosted inside an unprivileged LXC container on Proxmox. The weird part? I was root inside the container.

I expected that since I had already enabled nesting=1 and keyctl=1 in the LXC configuration, Docker would have the necessary permissions to modify kernel parameters via runc. In a standard VM, this is trivial. In a privileged container, it just works. But in an unprivileged container, the user namespace mapping creates a wall that runc cannot climb.

What actually happened is a collision between systemd (v243+), runc, and the Linux kernel's security model for unprivileged user namespaces. When you run an unprivileged LXC, the root user inside the container is actually a non-privileged user on the Proxmox host (usually UID 100000).

The kernel prevents these mapped users from modifying sysctl settings because those settings are often global or namespace-specific in ways that could allow a container to crash the host or leak information. runc, the runtime Docker uses, tries to apply these settings during container creation, but the kernel returns a permission denied error. Because of how some Docker versions handle this, the error is sometimes swallowed, and your app just runs with the wrong defaults.

If you're building a production-grade homelab, you probably don't want to just switch to a privileged container. That's a security nightmare. Instead, you have to move the configuration "up" the chain.

The fix is to apply the sysctl settings at the LXC level before the container fully initializes, or directly on the host if the parameter isn't namespaced. Since we want to keep the host clean, using an LXC pre-start hook is the cleanest way to inject these settings.

On the Proxmox host, you can add a hook to the container's configuration file (usually in /etc/pve/lxc/ID.conf).

# Add this to your LXC .conf file on the Proxmox host
lxc.hook.pre-start = /usr/bin/echo "net.ipv4.ip_local_port_range = 1024 65535" >> /etc/sysctl.d/99-lxc.conf

However, for most users, the most reliable method is to define the parameter in the host's sysctl.conf if it's a global setting, or use the lxc.sysctl directive in the config file:

# Example Proxmox LXC config snippet
arch: amd64
cores: 2
memory: 2048
net0: name=eth0,bridge=vmbr0,ip=10.0.0.x/24,gw=10.0.0.1
ostype: ubuntu
unprivileged: 1
features: nesting=1,keyctl=1
# Inject the sysctl here
lxc.sysctl.net.ipv4.ip_local_port_range = 1024 65535

After adding this, you have to restart the container. If you just restart the Docker daemon inside the LXC, the kernel parameter won't update because the LXC boundary is where the restriction lives.

This trap is common when you're trying to optimize networking or memory management (like vm.max_map_count for Elasticsearch) inside a nested environment. If you've dealt with the headache of GPU passthrough on Proxmox, you know that the gap between "it's a container" and "it's an unprivileged container" is where most of the pain lives.

One last thing to watch out for: UID shifts. If you're mounting NFS shares into these containers to provide storage for your Docker volumes, you'll hit the UID mismatch. The container thinks it's root (UID 0), but the host sees UID 100000. I've spent hours debugging "Permission Denied" on volumes only to realize I needed to chmod 0777 the host directory or properly map the IDs in the .conf file.

If you're scaling this into a larger cluster, I highly recommend moving these workloads to bare-metal Kubernetes. I wrote about my experience with Longhorn for bare-metal storage, and while the initial setup is heavier than an LXC, you stop fighting the Proxmox container permission war and start dealing with standard K8s primitives.

DEV Community: Guatu

Tesla P40 in a Homelab: 24GB of Inference on a Budget

The Passthrough Trap

The Solution: Host-Level Inference

1. Cleaning the Slate

2. Host Driver Installation

3. Deploying Ollama as a Systemd Service

The VRAM Reality Check

Monitoring the Blind Spot

Why This Actually Works

Lessons Learned

Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working

The Illusion of Health

Solving the Stale Mount Trap

The Capacity Lie: Snapshot Bloat

Permissions and the SecurityContext Gap

Monitoring That Actually Matters

Gotchas and Tradeoffs

Lessons Learned

Condition-Based vs Time-Based Maintenance: Making the Switch

What I tried first

The actual solution

1. Filtering the Noise

2. Condition-Based Escalation Rules

3. Optimizing the Alerting Pipeline

Why it works

Lessons learned

Proxmox Cluster Quorum: How Many Nodes Do You Actually Need

What I tried first

The actual solution

Option 1: The Three-Node Standard

Option 2: The QDevice (The "Cheap" Vote)

Option 3: Monitoring and API Integration

Why it works

Lessons learned

Kyverno Admission Controllers: Policy-as-Code That Actually Works

Why you'd choose a Policy Engine

Option A: OPA Gatekeeper

Option B: Kyverno

Decision Framework

My Pick and Why

The "Infrastructure Exclusion" Pattern

Handling Security Contexts without Breaking the Cluster

The Danger of synchronize: true

Orphaned Resources and the Cleanup Gap

Integration with the Wider Stack

Lessons Learned

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

The Wrong Way: Trusting the Orchestrator

The Actual Solution: Two-Tier Privacy Routing

1. The Reference-Only Knowledge Base

2. The Privacy Gate (Routing Layer)

3. Hardening the Execution

Why This Works

Lessons Learned

Tailscale Subnet Routers: Accessing Your LAN Without the VPN Headache

The Concept: Routing vs. Agent-based Access

Implementation: The "Happy Path" and the Reality

The Kubernetes and Gateway Trap

Turning it into an Exit Node

The Advanced Headache: NAT-PMP and Network Namespaces

Comparison: Subnet Routers vs. Traditional VPNs

Gotchas and Lessons Learned

Final Thoughts

PCIe Device Passthrough: NIC Name Instability and MAC Pinning

What I expected

What actually happened

The Fix: MAC Pinning

1. Identify the MAC address

2. Create the .link file

3. Update the network configuration

4. Apply and verify

Why this matters

The Tradeoff

Lessons Learned

GPU PCI Address Instability: When Your Card Moves Between Reboots

Cognitive Memory for Agents: Vector Search vs Activation-Based Recall

The Decision Point

Option A: Vector Search (The External Archive)

Option B: Activation-Based Recall (The Internal Intuition)

The Danger of `synchronize: true`