<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Guatu</title>
    <description>The latest articles on DEV Community by Guatu (@futhgar).</description>
    <link>https://dev.to/futhgar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3847021%2F5aa46faa-d8e6-4023-ad78-5a335f875d69.png</url>
      <title>DEV Community: Guatu</title>
      <link>https://dev.to/futhgar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/futhgar"/>
    <language>en</language>
    <item>
      <title>PCIe Device Passthrough: NIC Name Instability and MAC Pinning</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Fri, 08 May 2026 04:15:19 +0000</pubDate>
      <link>https://dev.to/futhgar/pcie-device-passthrough-nic-name-instability-and-mac-pinning-4di7</link>
      <guid>https://dev.to/futhgar/pcie-device-passthrough-nic-name-instability-and-mac-pinning-4di7</guid>
      <description>&lt;p&gt;My Proxmox node rebooted, and suddenly the host was unreachable via SSH. I had to plug in a physical monitor and keyboard only to find that my primary network interface, which had been &lt;code&gt;enp4s0&lt;/code&gt; for months, had decided to rename itself to &lt;code&gt;enp5s0&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Because my &lt;code&gt;/etc/network/interfaces&lt;/code&gt; file was explicitly tied to &lt;code&gt;enp4s0&lt;/code&gt;, the bridge didn't come up, the IP wasn't assigned, and I was locked out of my own hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I expected
&lt;/h2&gt;

&lt;p&gt;I expected the Linux kernel to consistently enumerate my PCIe devices. In a static hardware environment where nothing has moved, the PCI bus address should be deterministic. If the NIC is plugged into the same slot and the BIOS hasn't changed, &lt;code&gt;enp4s0&lt;/code&gt; should stay &lt;code&gt;enp4s0&lt;/code&gt; forever. This is the "happy path" most documentation assumes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually happened
&lt;/h2&gt;

&lt;p&gt;The reality is that PCIe enumeration is not always a constant. I'm using a mix of onboard NICs and a PCIe expansion card. I also have a GPU passed through to a VM. &lt;/p&gt;

&lt;p&gt;The surprise here is how the kernel's predictable network interface naming (systemd-udevd) interacts with the PCIe topology. When I added a new PCIe device and tweaked some BIOS settings for IOMMU, the way the kernel mapped the physical slots to the virtual naming changed. A slight shift in how the PCIe switch reported the devices caused the index to jump.&lt;/p&gt;

&lt;p&gt;This isn't just a "one-time fluke." If you're running a multi-node cluster or using GPUs that might move addresses (something I've documented before in &lt;a href="https://guatulabs.dev/posts/gpu-pci-address-instability-when-your-card-moves-between-reboots/" rel="noopener noreferrer"&gt;GPU PCI Address Instability&lt;/a&gt;), you'll find that the kernel is surprisingly flexible with where it puts things. &lt;/p&gt;

&lt;p&gt;The root cause is that &lt;code&gt;enp4s0&lt;/code&gt; is a name derived from the PCI location. If the location changes—even by one digit—the name changes. If your network config depends on that name, your system is one reboot away from a blackout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: MAC Pinning
&lt;/h2&gt;

&lt;p&gt;The only way to stop this is to stop relying on the PCI slot location and start relying on the hardware's unique identifier: the MAC address. &lt;/p&gt;

&lt;p&gt;I decided to use systemd &lt;code&gt;.link&lt;/code&gt; files. This allows me to tell the kernel: "I don't care where this device is on the PCIe bus; if it has this MAC address, call it &lt;code&gt;eth0&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Identify the MAC address
&lt;/h3&gt;

&lt;p&gt;First, I had to find the actual MAC of the problematic NIC while I had console access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip &lt;span class="nb"&gt;link &lt;/span&gt;show
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I looked for the interface that was currently named &lt;code&gt;enp5s0&lt;/code&gt; (the "wrong" name) and copied the &lt;code&gt;link/ether&lt;/code&gt; value.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Create the .link file
&lt;/h3&gt;

&lt;p&gt;I created a custom link file in &lt;code&gt;/etc/systemd/network/&lt;/code&gt;. I chose the name &lt;code&gt;10-lan.link&lt;/code&gt; to ensure it loads early in the boot process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/network/10-lan.link
&lt;/span&gt;&lt;span class="nn"&gt;[Match]&lt;/span&gt;
&lt;span class="py"&gt;MACAddress&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;00:11:22:33:44:55&lt;/span&gt;

&lt;span class="nn"&gt;[Link]&lt;/span&gt;
&lt;span class="py"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;eth0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Note: I've anonymized the MAC address above. Use your actual hardware MAC here.)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Update the network configuration
&lt;/h3&gt;

&lt;p&gt;Once the interface is pinned to &lt;code&gt;eth0&lt;/code&gt;, I had to update the Proxmox network configuration to match. I edited &lt;code&gt;/etc/network/interfaces&lt;/code&gt; to replace the volatile &lt;code&gt;enp4s0&lt;/code&gt; with the stable &lt;code&gt;eth0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example snippet from /etc/network/interfaces&lt;/span&gt;
auto eth0
iface eth0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.x/24
    gateway 10.0.0.1
    bridge-ports eth0
    bridge-stp off
    bridge-fd 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Apply and verify
&lt;/h3&gt;

&lt;p&gt;I ran &lt;code&gt;systemd-networkd-restart&lt;/code&gt; (or just rebooted, since I was already at the console) and verified the name with &lt;code&gt;ip a&lt;/code&gt;. The NIC was now consistently &lt;code&gt;eth0&lt;/code&gt;, regardless of whether the PCIe bus shifted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;If you're just running a single VM on a desktop, this is a minor annoyance. But if you're building a &lt;a href="https://guatulabs.dev/posts/building-production-homelab/" rel="noopener noreferrer"&gt;production-grade homelab&lt;/a&gt;, this is a critical failure point.&lt;/p&gt;

&lt;p&gt;You'll hit this specifically in these scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adding/Removing PCIe Hardware:&lt;/strong&gt; Adding a new NVMe drive or a GPU can shift the enumeration of other devices on the same root complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BIOS Updates:&lt;/strong&gt; A BIOS update often resets PCIe lane bifurcation or IOMMU settings, which can completely reorder how the kernel sees your NICs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using PCIe Switches:&lt;/strong&gt; Some high-end motherboards or riser cables use PCIe switches that can report different topologies depending on the power state of the devices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Tradeoff
&lt;/h3&gt;

&lt;p&gt;The tradeoff here is that you're moving away from the "modern" predictable naming convention back to the "old" &lt;code&gt;ethX&lt;/code&gt; style. Some people find &lt;code&gt;eth0&lt;/code&gt; ugly or outdated, but in a headless server environment, "ugly" is better than "unreachable."&lt;/p&gt;

&lt;p&gt;I've also seen people try to fix this using udev rules in &lt;code&gt;/etc/udev/rules.d/&lt;/code&gt;. While that works, &lt;code&gt;.link&lt;/code&gt; files are the native systemd way to handle this and are generally cleaner to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The biggest lesson here is that documentation for Proxmox and Debian assumes your hardware topology is a constant. It isn't. &lt;/p&gt;

&lt;p&gt;When you're doing complex things like PCIe passthrough—which I've detailed in my &lt;a href="https://guatulabs.dev/posts/gpu-passthrough-on-proxmox-gotcha-guide/" rel="noopener noreferrer"&gt;GPU Passthrough Gotcha Guide&lt;/a&gt;—you are intentionally messing with the PCI bus. You're telling the host kernel to ignore certain devices so the VM can claim them. This volatility is a side effect of that power.&lt;/p&gt;

&lt;p&gt;If you are passing through NICs or GPUs, do not trust the default interface names. Pin your critical management interfaces to their MAC addresses immediately. It takes five minutes to set up and saves you from a midnight trip to the server rack because a reboot decided your network card now lives at &lt;code&gt;enp6s0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For those of you managing larger fleets or complex AI agent infrastructure, this kind of hardware-level stability is the foundation. You can't build a reliable &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;multi-agent AI pipeline&lt;/a&gt; if the underlying Kubernetes worker nodes are randomly losing their network identity.&lt;/p&gt;

&lt;p&gt;Next time you're configuring a new node, don't just copy the &lt;code&gt;enpXsX&lt;/code&gt; name from the GUI. Take the extra step to pin it. Your future self will thank you when the next BIOS update doesn't break your entire cluster.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>pciepassthrough</category>
      <category>networking</category>
      <category>homelab</category>
    </item>
    <item>
      <title>GPU PCI Address Instability: When Your Card Moves Between Reboots</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Thu, 07 May 2026 00:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/gpu-pci-address-instability-when-your-card-moves-between-reboots-56mj</link>
      <guid>https://dev.to/futhgar/gpu-pci-address-instability-when-your-card-moves-between-reboots-56mj</guid>
      <description>&lt;p&gt;I spent an entire afternoon debugging a VM that refused to boot, only to find out my GPU had decided to change its PCI address. One reboot and the device that lived at &lt;code&gt;01:00.0&lt;/code&gt; suddenly migrated to &lt;code&gt;02:00.0&lt;/code&gt;. Because my Proxmox VM configuration was pinned to the old address, the VM crashed with a QEMU assertion error, and the GPU simply vanished from the guest.&lt;/p&gt;

&lt;p&gt;This usually happens because of how the BIOS handles PCIe enumeration during POST. If you have multiple PCIe devices or a complex motherboard topology, the bus numbering isn't always deterministic. This is compounded by AMD Ryzen C-states or weird UMA frame buffer settings that can delay device initialization, causing the kernel to assign addresses in a different order than the previous boot. If you've already dealt with &lt;a href="https://guatulabs.dev/posts/amd-igpu-stealing-your-ram-uma-frame-buffer-on-headless-servers/" rel="noopener noreferrer"&gt;AMD iGPU RAM theft&lt;/a&gt;, you know how sensitive these BIOS settings are.&lt;/p&gt;

&lt;p&gt;If you're on Proxmox 8.4+, the "happy path" is to use the &lt;code&gt;q35&lt;/code&gt; machine type. The older &lt;code&gt;i440fx&lt;/code&gt; is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.&lt;/p&gt;

&lt;p&gt;To stabilize this, I switched the VM to &lt;code&gt;q35&lt;/code&gt; and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Change VM to q35 machine type for better PCIe support&lt;/span&gt;
qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;VMID&amp;gt; &lt;span class="nt"&gt;--machine&lt;/span&gt; q35

&lt;span class="c"&gt;# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device&lt;/span&gt;
&lt;span class="c"&gt;# Replace &amp;lt;PCI_ADDRESS&amp;gt; with your current address (e.g., 0000:01:00.0)&lt;/span&gt;
qm &lt;span class="nb"&gt;set&lt;/span&gt; &amp;lt;VMID&amp;gt; &lt;span class="nt"&gt;-hostpci0&lt;/span&gt; &amp;lt;PCI_ADDRESS&amp;gt;,pcie&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)&lt;/span&gt;
&lt;span class="c"&gt;# Run this on the Proxmox host&lt;/span&gt;
&lt;span class="nb"&gt;echo &lt;/span&gt;0 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/bus/pci/devices/0000:&amp;lt;PCI_BUS&amp;gt;:&amp;lt;PCI_SLOT&amp;gt;.0/d3cold_allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the &lt;a href="https://guatulabs.dev/posts/nvidia-container-toolkit-why-the-default-runtime-matters" rel="noopener noreferrer"&gt;NVIDIA Container Toolkit&lt;/a&gt; to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.&lt;/p&gt;

&lt;p&gt;The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>gpupassthrough</category>
      <category>pcie</category>
      <category>homelab</category>
    </item>
    <item>
      <title>Cognitive Memory for Agents: Vector Search vs Activation-Based Recall</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Wed, 06 May 2026 22:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/cognitive-memory-for-agents-vector-search-vs-activation-based-recall-52lh</link>
      <guid>https://dev.to/futhgar/cognitive-memory-for-agents-vector-search-vs-activation-based-recall-52lh</guid>
      <description>&lt;p&gt;I spent a few weeks trying to build an agent that could remember specific user preferences across sessions without bloating the context window to a point where latency became unbearable. The standard advice is always "just use a vector database." But as the memory store grew, I noticed a weird gap: the agent could find a document about "user prefers dark mode" via cosine similarity, but it couldn't "recall" the immediate emotional state or the nuance of the last three turns of conversation unless they were explicitly mirrored in the embedding.&lt;/p&gt;

&lt;p&gt;The problem is that vector search is a retrieval mechanism, not a cognitive memory system. When you move from simple RAG to actual agentic memory, you have to choose between external vector search and internal activation-based recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Point
&lt;/h3&gt;

&lt;p&gt;You face this choice when your agent's "short-term" memory (the context window) is full, and your "long-term" memory (the database) is returning results that are mathematically similar but contextually irrelevant. &lt;/p&gt;

&lt;p&gt;If you need your agent to remember a 500-page technical manual, you need a vector store. If you need your agent to exhibit a consistent "personality" or recall a specific pattern of behavior that isn't easily summarized into a string of text for an embedding model, you need something closer to activation-based recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Vector Search (The External Archive)
&lt;/h3&gt;

&lt;p&gt;Vector search is the industry standard for a reason: it's easy to scale and the tooling is mature. You turn a piece of text into a vector using an embedding model (like &lt;code&gt;text-embedding-3-small&lt;/code&gt;), shove it into a store like FAISS or Milvus, and query it with another vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale:&lt;/strong&gt; You can store billions of vectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold Storage:&lt;/strong&gt; It doesn't eat VRAM. It lives on disk or in a dedicated database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interpretability:&lt;/strong&gt; I can literally query the database and see exactly which chunk of text was retrieved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The "Semantic Gap":&lt;/strong&gt; Cosine similarity is a blunt instrument. If a user says "That's not what I meant," a vector search might retrieve a passage about "meaning" or "intent" rather than understanding the correction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; You have to embed the query, hit the DB, and then stuff the results into the prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a basic implementation using FAISS. I use this for the "knowledge base" layer of my agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Dimension depends on your embedding model (e.g., 1536 for OpenAI)
&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt; 
&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;  &lt;span class="c1"&gt;# number of memory chunks
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# Mocking embeddings of agent experiences
&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# Querying for the top 4 most similar memories
&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved memory indices: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option B: Activation-Based Recall (The Internal Intuition)
&lt;/h3&gt;

&lt;p&gt;Activation-based recall is more akin to how biological memory works. Instead of searching a database, the "memory" is stored in the weights or the hidden states of the model. In modern agent architectures, this often involves using activation hooks or specialized memory layers (like Memory Transformers) that allow the model to trigger a recall based on the current internal state of the network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed:&lt;/strong&gt; There is no external API call or DB lookup. The recall happens during the forward pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nuance:&lt;/strong&gt; It captures "how" something was said, not just "what" was said. It's an associative trigger rather than a keyword search.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Black Box:&lt;/strong&gt; Debugging this is a nightmare. You can't just "look" at the database to see why the agent recalled a specific memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM Pressure:&lt;/strong&gt; Storing these activations or maintaining a dynamic memory network consumes precious GPU memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've experimented with simple activation hooks in PyTorch to track which "states" trigger certain behaviors. It's not a full-blown Memory Transformer, but it's a start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# In a real system, this would be a specific layer's activation
&lt;/span&gt;        &lt;span class="c1"&gt;# that represents a 'concept' or 'state'
&lt;/span&gt;        &lt;span class="n"&gt;activation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="c1"&gt;# Store the activation state for later recall/analysis
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detach&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;input_tensor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored state vector: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Decision Framework
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Vector Search&lt;/th&gt;
&lt;th&gt;Activation-Based Recall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive (TB+)&lt;/td&gt;
&lt;td&gt;Small (MB to GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds (Network/Disk)&lt;/td&gt;
&lt;td&gt;Microseconds (GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic/Keyword&lt;/td&gt;
&lt;td&gt;Associative/Pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easy (Query the DB)&lt;/td&gt;
&lt;td&gt;Hard (Analyze Tensors)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU/Disk/API&lt;/td&gt;
&lt;td&gt;VRAM/Compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My Pick and Why
&lt;/h3&gt;

&lt;p&gt;I don't pick one. I use a hybrid. &lt;/p&gt;

&lt;p&gt;If you're building a production agent, relying solely on vector search leads to that "robotic" feeling where the agent repeats the same retrieved snippet regardless of the conversation flow. Relying solely on activations is a recipe for a system you can't debug when it starts hallucinating.&lt;/p&gt;

&lt;p&gt;I implement a tiered system. I use a vector store for the "Library" (hard facts, documentation) and a sliding window of activations for the "Working Memory" (current mood, immediate goals, recent corrections). This mirrors the &lt;a href="https://dev.to/posts/six-layer-memory-architecture-for-claude-code"&gt;6-layer memory architecture&lt;/a&gt; I've used for my own tools.&lt;/p&gt;

&lt;p&gt;For those building multi-agent systems, I recommend offloading the vector search to a shared service and keeping the activation-based recall local to the agent's specific instance. This prevents the "shared memory" from becoming a noisy mess of conflicting embeddings. You can see how this fits into larger patterns in my post on &lt;a href="https://dev.to/posts/multi-agent-ai-systems-architecture-patterns"&gt;multi-agent architecture patterns&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're still struggling with agents that forget things every five minutes, you might be hitting a safety loop. I've written about &lt;a href="https://dev.to/posts/three-layer-safety-autonomous-agents"&gt;three-layer safety for autonomous agents&lt;/a&gt; which often solves the "infinite loop" problem that people mistake for a memory issue.&lt;/p&gt;

&lt;p&gt;If you need help designing a memory architecture that doesn't melt your GPU or your budget, check out my &lt;a href="https://guatulabs.com/services" rel="noopener noreferrer"&gt;AI agent consulting services&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons learned:&lt;/strong&gt; &lt;br&gt;
The docs for vector DBs make it sound like they replace the need for cognitive memory. They don't. They replace the need for a filing cabinet. If you want an agent that actually "feels" like it's learning from a conversation in real-time, you have to move closer to the activations.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>vectordatabases</category>
      <category>llmmemory</category>
      <category>cognitivearchitecture</category>
    </item>
    <item>
      <title>Vibration Monitoring Architecture: From Sensor to Dashboard</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Wed, 06 May 2026 16:15:04 +0000</pubDate>
      <link>https://dev.to/futhgar/vibration-monitoring-architecture-from-sensor-to-dashboard-26ib</link>
      <guid>https://dev.to/futhgar/vibration-monitoring-architecture-from-sensor-to-dashboard-26ib</guid>
      <description>&lt;p&gt;The first time I tried to stream raw vibration data to a dashboard, I managed to crash my MQTT broker in under ten minutes. I had a high-frequency accelerometer spitting out samples at 5kHz, and I thought I'd just wrap those values in JSON and send them over the wire. The result wasn't a pretty graph; it was a series of &lt;code&gt;Connection refused&lt;/code&gt; errors and a broker that had completely locked up under the weight of thousands of tiny packets per second.&lt;/p&gt;

&lt;p&gt;If you're building a vibration monitoring system, you're not just dealing with "IoT data." You're dealing with signal processing. There is a massive difference between reporting a temperature every 30 seconds and capturing the harmonic frequencies of a motor bearing. If you treat vibration data like any other telemetry, your network will choke, your database will bloat, and your dashboards will be useless.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I tried first (The wrong way)
&lt;/h3&gt;

&lt;p&gt;My initial assumption was that the "modern stack" (Sensor $\rightarrow$ MQTT $\rightarrow$ Time Series DB $\rightarrow$ Grafana) would handle everything. I used a cheap industrial sensor that output raw voltage via a 4-20mA loop, fed into a PLC, which then pushed data to a Python script on a Raspberry Pi.&lt;/p&gt;

&lt;p&gt;I wrote a simple loop that read the sensor and published to a topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DO NOT DO THIS
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sensor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;factory/machine1/vibration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I quickly hit three walls:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network Saturation:&lt;/strong&gt; Sending one MQTT packet per sample is an architectural sin. The overhead of the TCP/IP stack and MQTT headers is larger than the actual payload. I was spending 90% of my bandwidth on headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Explosion:&lt;/strong&gt; InfluxDB is great, but inserting 5,000 points per second per sensor is a recipe for a disk space crisis. My cardinality exploded, and queries that should have taken milliseconds started taking 30 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Noise" Problem:&lt;/strong&gt; The raw data was a jagged mess. I couldn't see the actual vibration patterns because the high-frequency electrical noise from the nearby VFDs (Variable Frequency Drives) was masking the mechanical signal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I realized that the gap between the sensor and the dashboard isn't a straight line. It's a funnel. You have to aggressively reduce the data volume at the edge before it ever touches the network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Actual Solution: The Edge-Heavy Pipeline
&lt;/h3&gt;

&lt;p&gt;To make this work, I shifted the intelligence to the edge. The goal is to move from "streaming raw samples" to "streaming features." Instead of sending every single point, I calculate the RMS (Root Mean Square), Peak-to-Peak, and FFT (Fast Fourier Transform) bins locally.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Signal Conditioning and Edge Processing
&lt;/h4&gt;

&lt;p&gt;I moved the processing to a dedicated edge gateway. I used a Python-based service that buffers samples in memory, applies a digital filter to remove electrical noise, and calculates the metrics.&lt;/p&gt;

&lt;p&gt;Here is the implementation of the signal conditioning and feature extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.signal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;butter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filtfilt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;paho.mqtt.client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration for a 10kHz sampling rate
&lt;/span&gt;&lt;span class="n"&gt;FS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt; 
&lt;span class="n"&gt;CUTOFF&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="c1"&gt;# Remove noise above 2kHz
&lt;/span&gt;&lt;span class="n"&gt;ORDER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;butter_lowpass_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nyq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;fs&lt;/span&gt;
    &lt;span class="n"&gt;normal_cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;nyq&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;butter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normal_cutoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;btype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analog&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;filtfilt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Filter the raw signal to remove high-frequency noise
&lt;/span&gt;    &lt;span class="n"&gt;filtered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;butter_lowpass_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CUTOFF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ORDER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate RMS - the primary indicator of overall vibration level
&lt;/span&gt;    &lt;span class="n"&gt;rms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate Peak-to-Peak
&lt;/span&gt;    &lt;span class="n"&gt;ptp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ptp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Perform FFT to find the dominant frequency
&lt;/span&gt;    &lt;span class="n"&gt;fft_vals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;freqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfftfreq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;FS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dominant_freq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;freqs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fft_vals&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rms&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ptp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dom_freq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dominant_freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Main loop: Buffer 1000 samples, then send 1 summary packet
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mqtt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mqtt-broker.example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1883&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_sensor_raw&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Mock function for ADC read
&lt;/span&gt;    &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Send summary instead of 1000 raw points
&lt;/span&gt;        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iiot/machine1/vibration/features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="c1"&gt;# Clear buffer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. The Transport Layer (MQTT 5.0)
&lt;/h4&gt;

&lt;p&gt;For the broker, I shifted from a basic Mosquitto setup to a more controlled configuration. Since vibration data is critical for predictive maintenance, I needed to ensure that the "heartbeat" of the machine was always known.&lt;/p&gt;

&lt;p&gt;I used MQTT 5.0 "Will Messages" to detect if a gateway went offline. If the gateway crashes, the broker immediately publishes a "disconnected" status to the health topic, so the dashboard doesn't just show a flat line (which could be mistaken for a stopped machine).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# mosquitto.conf snippet&lt;/span&gt;
&lt;span class="s"&gt;listener &lt;/span&gt;&lt;span class="m"&gt;1883&lt;/span&gt;
&lt;span class="s"&gt;allow_anonymous &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="s"&gt;password_file /etc/mosquitto/passwd&lt;/span&gt;
&lt;span class="c1"&gt;# Prevent the broker from being overwhelmed by slow consumers&lt;/span&gt;
&lt;span class="s"&gt;max_queued_messages &lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I've written more about choosing the right broker in my &lt;a href="https://guatulabs.dev/posts/mqtt-broker-selection-hivemq-vs-mosquitto-for-industrial-use/" rel="noopener noreferrer"&gt;MQTT Broker Selection&lt;/a&gt; post, but for vibration, the priority is low latency and high reliability over massive scale.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Storage and Visualization
&lt;/h4&gt;

&lt;p&gt;I used InfluxDB 2.x for storage because of its native handling of time-series data. Instead of storing the raw waveform, I store the calculated features. This reduces the storage requirement by 1000x.&lt;/p&gt;

&lt;p&gt;In Grafana, I set up a dashboard that monitors the RMS value. However, looking at a raw line graph of vibration is usually useless for operators. They don't know if 0.5g is "bad" or "normal." &lt;/p&gt;

&lt;p&gt;I integrated this with a health scoring system. I used a Flux query in InfluxDB to compare the current RMS against a baseline (the average of the last 7 days).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;InfluxDB&lt;/span&gt; &lt;span class="n"&gt;Flux&lt;/span&gt; &lt;span class="n"&gt;Query&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;Relative&lt;/span&gt; &lt;span class="n"&gt;Vibration&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"iiot_data"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"_measurement"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nv"&gt;"vibration_sensor"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"_field"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nv"&gt;"rms"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;aggregateWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;every&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;|&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_value&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;Normalize&lt;/span&gt; &lt;span class="n"&gt;against&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="k"&gt;g&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feeds directly into the concept of &lt;a href="https://guatulabs.dev/posts/equipment-health-scoring-one-number-your-operators-actually-check/" rel="noopener noreferrer"&gt;Equipment Health Scoring&lt;/a&gt;, where the goal is to give the operator a single "Health %" rather than a complex spectrum analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this architecture works
&lt;/h3&gt;

&lt;p&gt;The reason this works is that it respects the laws of physics and networking. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Nyquist-Shannon Theorem&lt;/strong&gt; tells us we need to sample at twice the frequency of the signal we want to capture. If you want to detect a bearing fault at 2kHz, you must sample at 4kHz+. Trying to do this over WiFi or Ethernet using standard JSON-over-MQTT is impossible because the packet overhead kills the throughput.&lt;/p&gt;

&lt;p&gt;By calculating the RMS and FFT at the edge, we are performing &lt;strong&gt;Data Reduction&lt;/strong&gt;. We transform a high-bandwidth signal (time domain) into a low-bandwidth set of descriptors (frequency domain). &lt;/p&gt;

&lt;p&gt;The edge processing also acts as a mechanical filter. By using a Butterworth low-pass filter, I can strip out the 60Hz hum from the power lines and the high-frequency spikes from the VFDs. If you do this in the cloud, you've already wasted the bandwidth sending noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons learned and caveats
&lt;/h3&gt;

&lt;p&gt;If I had to build this again, I'd change a few things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Hardware-level filtering:&lt;/strong&gt; I spent too much time in Python trying to fix signal noise. In a real industrial environment, you should use an analog anti-aliasing filter (a physical capacitor/resistor circuit) before the signal ever hits the ADC. Software filters are great, but they can't fix aliasing if the signal was already corrupted during sampling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The "Buffer" Trap:&lt;/strong&gt; My Python script used a simple list for the buffer. At very high sampling rates, Python's list appending becomes slow. I had to switch to &lt;code&gt;numpy&lt;/code&gt; arrays with pre-allocated memory to avoid garbage collection pauses that caused gaps in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Provisioning the Edge:&lt;/strong&gt; Managing these Python scripts across five different gateways was a nightmare. I eventually moved the deployment to a GitOps flow, using &lt;a href="https://guatulabs.dev/posts/automating-infrastructure-with-opentofu-and-github-actions/" rel="noopener noreferrer"&gt;OpenTofu and GitHub Actions&lt;/a&gt; to manage the underlying VM configurations on my Proxmox cluster, ensuring every gateway had the exact same version of &lt;code&gt;scipy&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The Dashboard Paradox:&lt;/strong&gt; The more data I put on the dashboard, the less the operators used it. The final version of the system only shows three things: a Green/Yellow/Red light for health, the current RMS value, and a "Time to Maintenance" estimate. Everything else (the FFT bins, the raw waveforms) is hidden in a "Deep Dive" tab that only the reliability engineer ever opens.&lt;/p&gt;

&lt;p&gt;Vibration monitoring is a classic example of where "more data" is actually "less information." The value isn't in the sensor; it's in the reduction process that happens between the sensor and the screen.&lt;/p&gt;

</description>
      <category>iiot</category>
      <category>vibrationanalysis</category>
      <category>mqtt</category>
      <category>influxdb</category>
    </item>
    <item>
      <title>Unprivileged LXC + Docker: The runc Sysctl Permission Trap</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Tue, 05 May 2026 00:15:20 +0000</pubDate>
      <link>https://dev.to/futhgar/unprivileged-lxc-docker-the-runc-sysctl-permission-trap-fb5</link>
      <guid>https://dev.to/futhgar/unprivileged-lxc-docker-the-runc-sysctl-permission-trap-fb5</guid>
      <description>&lt;p&gt;&lt;code&gt;sysctl: setting key "net.ipv4.ip_local_port_range": Permission denied&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;I saw this error while trying to tune the network stack for a high-concurrency service running in Docker, which itself was hosted inside an unprivileged LXC container on Proxmox. The weird part? I was root inside the container.&lt;/p&gt;

&lt;p&gt;I expected that since I had already enabled &lt;code&gt;nesting=1&lt;/code&gt; and &lt;code&gt;keyctl=1&lt;/code&gt; in the LXC configuration, Docker would have the necessary permissions to modify kernel parameters via &lt;code&gt;runc&lt;/code&gt;. In a standard VM, this is trivial. In a privileged container, it just works. But in an unprivileged container, the user namespace mapping creates a wall that &lt;code&gt;runc&lt;/code&gt; cannot climb.&lt;/p&gt;

&lt;p&gt;What actually happened is a collision between &lt;code&gt;systemd&lt;/code&gt; (v243+), &lt;code&gt;runc&lt;/code&gt;, and the Linux kernel's security model for unprivileged user namespaces. When you run an unprivileged LXC, the root user inside the container is actually a non-privileged user on the Proxmox host (usually UID 100000). &lt;/p&gt;

&lt;p&gt;The kernel prevents these mapped users from modifying &lt;code&gt;sysctl&lt;/code&gt; settings because those settings are often global or namespace-specific in ways that could allow a container to crash the host or leak information. &lt;code&gt;runc&lt;/code&gt;, the runtime Docker uses, tries to apply these settings during container creation, but the kernel returns a permission denied error. Because of how some Docker versions handle this, the error is sometimes swallowed, and your app just runs with the wrong defaults.&lt;/p&gt;

&lt;p&gt;If you're building a production-grade homelab, you probably don't want to just switch to a privileged container. That's a security nightmare. Instead, you have to move the configuration "up" the chain.&lt;/p&gt;

&lt;p&gt;The fix is to apply the &lt;code&gt;sysctl&lt;/code&gt; settings at the LXC level before the container fully initializes, or directly on the host if the parameter isn't namespaced. Since we want to keep the host clean, using an LXC pre-start hook is the cleanest way to inject these settings.&lt;/p&gt;

&lt;p&gt;On the Proxmox host, you can add a hook to the container's configuration file (usually in &lt;code&gt;/etc/pve/lxc/ID.conf&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add this to your LXC .conf file on the Proxmox host&lt;/span&gt;
lxc.hook.pre-start &lt;span class="o"&gt;=&lt;/span&gt; /usr/bin/echo &lt;span class="s2"&gt;"net.ipv4.ip_local_port_range = 1024 65535"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /etc/sysctl.d/99-lxc.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, for most users, the most reliable method is to define the parameter in the host's &lt;code&gt;sysctl.conf&lt;/code&gt; if it's a global setting, or use the &lt;code&gt;lxc.sysctl&lt;/code&gt; directive in the config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example Proxmox LXC config snippet&lt;/span&gt;
&lt;span class="na"&gt;arch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;amd64&lt;/span&gt;
&lt;span class="na"&gt;cores&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
&lt;span class="na"&gt;net0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;name=eth0,bridge=vmbr0,ip=10.0.0.x/24,gw=10.0.0.1&lt;/span&gt;
&lt;span class="na"&gt;ostype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu&lt;/span&gt;
&lt;span class="na"&gt;unprivileged&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;features&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nesting=1,keyctl=1&lt;/span&gt;
&lt;span class="c1"&gt;# Inject the sysctl here&lt;/span&gt;
&lt;span class="s"&gt;lxc.sysctl.net.ipv4.ip_local_port_range = 1024 &lt;/span&gt;&lt;span class="m"&gt;65535&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After adding this, you have to restart the container. If you just restart the Docker daemon inside the LXC, the kernel parameter won't update because the LXC boundary is where the restriction lives.&lt;/p&gt;

&lt;p&gt;This trap is common when you're trying to optimize networking or memory management (like &lt;code&gt;vm.max_map_count&lt;/code&gt; for Elasticsearch) inside a nested environment. If you've dealt with the headache of &lt;a href="https://guatulabs.dev/posts/gpu-passthrough-on-proxmox-gotcha-guide/" rel="noopener noreferrer"&gt;GPU passthrough on Proxmox&lt;/a&gt;, you know that the gap between "it's a container" and "it's an unprivileged container" is where most of the pain lives.&lt;/p&gt;

&lt;p&gt;One last thing to watch out for: UID shifts. If you're mounting NFS shares into these containers to provide storage for your Docker volumes, you'll hit the UID mismatch. The container thinks it's root (UID 0), but the host sees UID 100000. I've spent hours debugging "Permission Denied" on volumes only to realize I needed to &lt;code&gt;chmod 0777&lt;/code&gt; the host directory or properly map the IDs in the &lt;code&gt;.conf&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;If you're scaling this into a larger cluster, I highly recommend moving these workloads to bare-metal Kubernetes. I wrote about my experience with &lt;a href="https://guatulabs.dev/posts/kubernetes-storage-on-bare-metal-longhorn-in-practice/" rel="noopener noreferrer"&gt;Longhorn for bare-metal storage&lt;/a&gt;, and while the initial setup is heavier than an LXC, you stop fighting the Proxmox container permission war and start dealing with standard K8s primitives.&lt;/p&gt;

</description>
      <category>proxmox</category>
      <category>lxc</category>
      <category>docker</category>
      <category>sysctl</category>
    </item>
    <item>
      <title>AdGuard Home: Network-Wide DNS Filtering with Failover</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Mon, 04 May 2026 22:15:20 +0000</pubDate>
      <link>https://dev.to/futhgar/adguard-home-network-wide-dns-filtering-with-failover-1i0</link>
      <guid>https://dev.to/futhgar/adguard-home-network-wide-dns-filtering-with-failover-1i0</guid>
      <description>&lt;p&gt;DNS is the single point of failure that makes everyone in the house complain that "the internet is down" when, in reality, your DNS container just crashed. I've spent too much time as the sole admin of my network having to manually flip DNS settings on my router because a single AdGuard Home instance decided to stop responding. If you're running this in a homelab, you can't just set it and forget it. You need a failover strategy that doesn't require you to touch a CLI while your family is staring at you.&lt;/p&gt;

&lt;p&gt;The mistake most people make is trusting the default upstream behavior. They add three upstream servers and assume AdGuard Home will magically route around a dead one instantly. In practice, depending on your version and config, you can still hit timeouts that feel like a total outage. I've moved my setup to a Kubernetes deployment using MetalLB to give it a static IP, but the real win is the explicit failover logic in the &lt;code&gt;adguard-home.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I prefer using a combination of Cloudflare and Quad9 for the primary upstreams, with a dedicated fallback. This ensures that if my primary DNS providers have a routing issue, the system pivots to a tertiary option without dropping the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# adguard-home.yaml snippets&lt;/span&gt;
&lt;span class="na"&gt;upstream_dns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.1.1.1"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0.1"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9.9.9.9"&lt;/span&gt;

&lt;span class="na"&gt;dns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Use parallel requests to find the fastest response&lt;/span&gt;
  &lt;span class="na"&gt;upstream_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parallel&lt;/span&gt; 

&lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;health_check_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;health_check_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;fallback_upstream&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.8.8.8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For those running this on K8s, don't skimp on memory limits. I initially set my memory request too low and saw the OOM killer terminate the pod every time I updated a large blocklist. I now pin my resources to ensure stability, especially when integrated with &lt;a href="https://guatulabs.dev/posts/cert-manager-cloudflare-dns-01-automated-tls-for-everything/" rel="noopener noreferrer"&gt;cert-manager for automated TLS&lt;/a&gt; to secure the dashboard.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;adguard-home k8s-at-home/adguard-home &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; network &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.tag&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; resources.limits.memory&lt;span class="o"&gt;=&lt;/span&gt;1Gi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; resources.requests.memory&lt;span class="o"&gt;=&lt;/span&gt;256Mi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The biggest lesson here is that "high availability" for DNS isn't just about having two pods. It's about how the system handles the gap between a server being "up" and a server actually returning a valid record. If you're building out larger infrastructure, I've found that combining this with a strict &lt;a href="https://guatulabs.dev/posts/kubernetes-manifest-validation-catching-errors-before-merge/" rel="noopener noreferrer"&gt;manifest validation pipeline&lt;/a&gt; prevents the kind of YAML typos that can take your entire network offline.&lt;/p&gt;

&lt;p&gt;Keep your upstreams diverse and your memory limits realistic.&lt;/p&gt;

</description>
      <category>dns</category>
      <category>adguardhome</category>
      <category>infrastructure</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Three-Layer Safety for Autonomous Agents: Stopping the Infinite Loop</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Thu, 30 Apr 2026 22:15:29 +0000</pubDate>
      <link>https://dev.to/futhgar/three-layer-safety-for-autonomous-agents-stopping-the-infinite-loop-3go5</link>
      <guid>https://dev.to/futhgar/three-layer-safety-for-autonomous-agents-stopping-the-infinite-loop-3go5</guid>
      <description>&lt;p&gt;I watched an autonomous agent spend three hours and 40,000 tokens trying to close a GitHub issue that had an open dependency, only to fail because it kept hallucinating a &lt;code&gt;force_close&lt;/code&gt; flag that didn't exist in the API. It didn't just fail; it entered a perfect infinite loop: it would call the tool, get a 400 error, interpret the error as a "temporary network glitch," and try again with the exact same payload.&lt;/p&gt;

&lt;p&gt;If you've built agents that actually touch production systems, you know this feeling. Prompting the agent to "be careful" or "follow the schema" is a placebo. When you move from a chat window to an autonomous loop, the gap between the LLM's intent and the system's reality becomes a canyon where agents go to die (and burn through your API credits).&lt;/p&gt;

&lt;p&gt;For anyone running agent orchestration in a homelab or production environment, you need a safety architecture that doesn't rely on the model's "good behavior." I've moved to a three-layer safety model: Token-Level Enforcement, Pre-Execution Gates, and Execution Isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;My first instinct was to lean heavily on PydanticAI. The idea of using Pydantic for type-safe tool calling seemed like the silver bullet. I spent a week building out complex schemas, thinking that if the code validated the output, the agent would simply "learn" to provide the correct format.&lt;/p&gt;

&lt;p&gt;I was wrong. I hit a wall where the agent would produce a JSON object that was &lt;em&gt;almost&lt;/em&gt; correct, but it would miss a closing brace or add a trailing comma. Pydantic would throw a &lt;code&gt;ValidationError&lt;/code&gt;, the agent would see that error in its history, and then it would attempt to "fix" the JSON by adding even more commentary around the code block. This created a feedback loop of &lt;code&gt;ValidationError&lt;/code&gt; $\rightarrow$ &lt;code&gt;Apology&lt;/code&gt; $\rightarrow$ &lt;code&gt;Broken JSON&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then I tried adding a "supervisor" agent to review the actions of the "worker" agent. This just doubled my latency and doubled my token cost without actually solving the root cause. The supervisor often hallucinated the same API capabilities as the worker because they were using the same base model.&lt;/p&gt;

&lt;p&gt;The real problem wasn't the logic; it was the lack of deterministic boundaries. I was treating the LLM as a reliable software component when it's actually a probabilistic engine. To make it safe, I had to stop trying to "convince" the model to be safe and start forcing it to be safe at the infrastructure level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Token-Level Schema Enforcement
&lt;/h2&gt;

&lt;p&gt;The first layer of safety happens before the agent even finishes its sentence. If you're using Ollama v0.5.0 or newer, you can stop relying on the model to "try its best" with JSON.&lt;/p&gt;

&lt;p&gt;Most people use the OpenAI-compatible API layer provided by frameworks, but that often just wraps the prompt in "Please return JSON." Ollama now supports a native &lt;code&gt;format&lt;/code&gt; parameter that enforces the schema at the token-sampling level. This means the model physically cannot sample a token that violates the JSON schema.&lt;/p&gt;

&lt;p&gt;Here is how I implemented this for my homelab health reports using &lt;code&gt;qwen2.5:14b-instruct&lt;/code&gt;. I switched from the 32B model to the 14B variant because the 32B was causing 502 timeouts on my Tesla P40s due to VRAM pressure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="c1"&gt;# Define the strict structure we want
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HomelabHealthReport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;node_status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;critical_alerts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;storage_utilization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Percentage 0-100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Extract the JSON schema for Ollama
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HomelabHealthReport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_json_schema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_safe_report&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# We bypass the high-level wrappers and hit the API directly
&lt;/span&gt;    &lt;span class="c1"&gt;# to ensure the 'format' parameter is actually passed.
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://ollama:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5:14b-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# This is the magic: token-level enforcement
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a health report for the homelab based on current metrics.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Result is guaranteed to be valid JSON matching HomelabHealthReport
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By moving the constraint to the sampler, I eliminated the &lt;code&gt;ValidationError&lt;/code&gt; loops entirely. The model no longer "guesses" the JSON; it is constrained by the grammar of the schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: The Pre-Execution Gate (ActionGate)
&lt;/h2&gt;

&lt;p&gt;Even with perfect JSON, an agent can still decide to do something stupid. Token-level safety ensures the &lt;em&gt;format&lt;/em&gt; is right, but it doesn't ensure the &lt;em&gt;intent&lt;/em&gt; is safe.&lt;/p&gt;

&lt;p&gt;I implemented an &lt;code&gt;ActionGate&lt;/code&gt;. This is a deterministic middleware layer that sits between the agent's tool-call and the actual execution. It doesn't use an LLM. It uses hard-coded business logic and state checks.&lt;/p&gt;

&lt;p&gt;If an agent tries to close a ticket, the &lt;code&gt;ActionGate&lt;/code&gt; checks if there are open dependencies. If it tries to reboot a node, it checks if that node is currently the only one running a critical service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_action_safety&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deterministic safety check. 
    No LLMs allowed here.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Prevent closing issues that have blocking dependencies
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;close_issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;issue_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;issue_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_has_dependency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Safety Violation: Cannot close issue &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;issue_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; while dependencies are open.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Prevent destructive actions on production nodes during peak hours
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reboot_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;node_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;peak_hours&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;SafetyException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Safety Violation: Reboot of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; forbidden during peak hours.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="c1"&gt;# Usage in the agent loop
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;check_action_safety&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;SafetyException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# We feed the specific error back to the agent so it can pivot
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action rejected by Safety Gate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the "infinite loop of failure" I mentioned earlier. Instead of the agent getting a generic 400 error from an API and thinking it's a network glitch, it gets a clear, human-readable explanation: "You cannot do this because X." This forces the agent to change its strategy rather than just retrying the same failed request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Execution Isolation and Shell Safety
&lt;/h2&gt;

&lt;p&gt;The final layer is where the rubber meets the road. I've spent too many hours debugging "quoting hell." &lt;/p&gt;

&lt;p&gt;When you have an agent generating a command that needs to run over SSH, inside a Proxmox container (&lt;code&gt;pct exec&lt;/code&gt;), as a specific user (&lt;code&gt;su&lt;/code&gt;), and then executing a Python script, you have four layers of shell interpretation. If you use f-strings to build these commands, a single single-quote in the agent's output will break the entire pipeline.&lt;/p&gt;

&lt;p&gt;I saw this happen when an agent tried to pass a complex JSON string as an argument to a script. The shell interpreted the quotes, the &lt;code&gt;su&lt;/code&gt; command stripped another layer, and by the time it hit Python, the syntax was mangled.&lt;/p&gt;

&lt;p&gt;The fix is to stop passing code as shell arguments. Instead, pipe the code directly into the stdin of the remote process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wrong way (prone to quoting errors):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This will break the moment the agent adds a ' or " to the payload&lt;/span&gt;
ssh node-a &lt;span class="s2"&gt;"pct exec 101 -- su - user -c 'python3 -c &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;print(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Hello World&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;)&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The right way (Shell-safe piping):&lt;/strong&gt;&lt;br&gt;
I wrote a helper that writes the agent's intended Python logic to a temporary file or pipes it directly. This avoids the shell's interpretation of the string entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# We pipe the actual script content into the remote shell&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; ~/bin/helpers/scout-ideas-helper.py | &lt;span class="se"&gt;\&lt;/span&gt;
  ssh node-a &lt;span class="s2"&gt;"pct exec 101 -- su - user -c 'python3 -'"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this setup, &lt;code&gt;python3 -&lt;/code&gt; tells Python to execute the code coming from stdin. The shell only sees the command to start Python, not the code itself. This completely eliminates the quoting nightmare.&lt;/p&gt;

&lt;p&gt;To manage the tools themselves, I've moved away from custom boilerplate and started using FastMCP. It allows me to wrap my MSAM (Multi-Agent System Architecture) tools into a standardized server that the agents can discover and use without me having to manually update the tool definitions every time I add a new function. I've detailed the setup for this in my post on &lt;a href="https://guatulabs.dev/posts/building-mcp-servers-with-fastmcp/" rel="noopener noreferrer"&gt;Building MCP Servers with FastMCP&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;This architecture works because it acknowledges that the LLM is the most unreliable part of the system. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token-level enforcement&lt;/strong&gt; removes the "formatting" problem. The agent can no longer fail because it forgot a comma.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The ActionGate&lt;/strong&gt; removes the "logic" problem. The agent can no longer perform an action that is fundamentally unsafe, regardless of how confident it is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Isolation&lt;/strong&gt; removes the "infrastructure" problem. The agent's output is treated as data (stdin) rather than as a command (shell argument).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When you combine these, you move from a system that is "mostly working" to one that is "predictably bounded."&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;The biggest surprise was how much the &lt;code&gt;format&lt;/code&gt; parameter in Ollama reduced the need for complex prompt engineering. I spent weeks refining a "System Prompt" to ensure JSON compliance, only to find that a single API parameter did the job better than 500 words of instructions.&lt;/p&gt;

&lt;p&gt;If I were to do this over again, I would have implemented the &lt;code&gt;ActionGate&lt;/code&gt; much sooner. I spent too much time trying to make the agent "smarter" when I should have just made the environment "stricter."&lt;/p&gt;

&lt;p&gt;A few caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Each layer adds a small amount of overhead. The &lt;code&gt;ActionGate&lt;/code&gt; is negligible (milliseconds), but the token-level enforcement can slightly increase the time to first token because the sampler has to do more work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt;: As I noted, model size matters. Qwen 2.5 14B is the sweet spot for my hardware. If you're running on limited VRAM, don't chase the 32B or 70B models just for the sake of "intelligence" if it leads to 502 timeouts and unstable inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Drift&lt;/strong&gt;: Ensure your agent's memory is cleaned up. I use a &lt;a href="https://guatulabs.dev/posts/six-layer-memory-architecture-for-claude-code/" rel="noopener noreferrer"&gt;six-layer memory architecture&lt;/a&gt; to prevent the agent from getting confused by outdated context, which is often the root cause of why it tries to perform unsafe actions in the first place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building autonomous agents isn't about finding the perfect model; it's about building the perfect cage for that model to operate in.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>llmops</category>
      <category>mcpservers</category>
      <category>ollama</category>
    </item>
    <item>
      <title>Stop Merging Broken YAML: Kubernetes Manifest Validation in CI</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Sat, 25 Apr 2026 22:15:35 +0000</pubDate>
      <link>https://dev.to/futhgar/stop-merging-broken-yaml-kubernetes-manifest-validation-in-ci-52g9</link>
      <guid>https://dev.to/futhgar/stop-merging-broken-yaml-kubernetes-manifest-validation-in-ci-52g9</guid>
      <description>&lt;p&gt;Pushing a broken manifest to your main branch is a rite of passage, but it's one that becomes significantly more painful when you're running a GitOps workflow with ArgoCD. I've spent far too many late nights staring at a "Sync Failed" status in ArgoCD, only to realize I had a typo in a Traefik IngressRoute or a missing resource limit that Kyverno was blocking. The problem isn't just the error itself; it's the feedback loop. If the error only surfaces during deployment, your CI pipeline has failed its primary job.&lt;/p&gt;

&lt;p&gt;The goal is to move validation as far left as possible. I started integrating &lt;code&gt;kubeconform&lt;/code&gt; into my GitHub Actions workflow to catch structural errors—like invalid API versions or malike fields—before the code even reaches a pull request review. However, structural validation is only half the battle. You also have to deal with policy enforcement. I recently ran into a situation where a Kyverno policy enforcing resource limits on all Jobs was breaking my CloudNativePG (CNPG) deployments. The CNPG operator creates Jobs that don't always follow the standard resource pattern, and because the policy was too broad, the cluster refused to provision the primary.&lt;/p&gt;

&lt;p&gt;The fix involves two parts: using &lt;code&gt;kubeconform&lt;/code&gt; for schema validation in CI and using targeted exclusions in your Kyverno policies. For the CI side, you don't need a complex setup. A simple action step can scan your entire manifests directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Action snippet for manifest validation&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;validate-manifests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Validate Kubernetes manifests&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yannh/kubernetes-manifest-validate@v1.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;manifests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;kubernetes/workloads/**/*.yaml&lt;/span&gt;
            &lt;span class="s"&gt;kubernetes/infrastructure/**/*.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the cluster side, when you have a legitimate reason to bypass a policy—like the CNPG example—don't just disable the policy globally. Use labels to create an exclusion scope. This keeps your &lt;a href="https://guatulabs.dev/posts/gitops-for-homelabs-argocd-app-of-apps/" rel="noopener noreferrer"&gt;GitOps for Homelabs&lt;/a&gt; workflow clean without sacrificing security for the rest of your workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kyverno.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Policy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;require-resource-limits&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;enforce-limits-on-jobs&lt;/span&gt;
      &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;kinds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
      &lt;span class="c1"&gt;# Exclude CNPG clusters so the operator can manage its own jobs&lt;/span&gt;
      &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cnpg.io/cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
      &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;containers&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;have&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;resource&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limits&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;defined."&lt;/span&gt;
        &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
                        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validating at the PR stage catches the "dumb" mistakes, while smart policy exclusions prevent the "smart" tools from breaking your legitimate infrastructure.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>gitops</category>
      <category>cicd</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>GPU D3cold Power States: How to Brick Your Card Without Trying</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Fri, 24 Apr 2026 18:15:49 +0000</pubDate>
      <link>https://dev.to/futhgar/gpu-d3cold-power-states-how-to-brick-your-card-without-trying-3bnn</link>
      <guid>https://dev.to/futhgar/gpu-d3cold-power-states-how-to-brick-your-card-without-trying-3bnn</guid>
      <description>&lt;p&gt;THE SYMPTOM: My NVIDIA Tesla P40 would stop responding after a VM shutdown. No error messages, just a dead GPU that required a full host reboot to recover.&lt;/p&gt;

&lt;p&gt;WHAT I EXPECTED: A clean shutdown of a VM with GPU passthrough should leave the GPU in a ready state. I assumed the host would handle power states gracefully.&lt;/p&gt;

&lt;p&gt;WHAT ACTUALLY HAPPENED: The GPU went into D3cold, a low-power state that it couldn't exit without a full host reboot. This happened even after proper VM shutdowns. The issue was especially prevalent on Proxmox 8.4 with kernel 6.8.x and QEMU 8.0.1, where the lack of FLR support on the P40 made it impossible to reset the GPU from the host.&lt;/p&gt;

&lt;p&gt;THE FIX: I disabled D3cold before passthrough by writing &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;/sys/bus/pci/devices/0000:08:00.0/d3cold_allowed&lt;/code&gt;. I also pinned the GPU’s PCI address using a udev rule to prevent it from shifting on reboot. Here's the rule I used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;ACTION&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="s2"&gt;"add"&lt;/span&gt;, &lt;span class="nv"&gt;SUBSYSTEM&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="s2"&gt;"pci"&lt;/span&gt;, ATTR&lt;span class="o"&gt;{&lt;/span&gt;vendor&lt;span class="o"&gt;}==&lt;/span&gt;&lt;span class="s2"&gt;"0x10de"&lt;/span&gt;, ATTR&lt;span class="o"&gt;{&lt;/span&gt;device&lt;span class="o"&gt;}==&lt;/span&gt;&lt;span class="s2"&gt;"0x1b80"&lt;/span&gt;, ATTR&lt;span class="o"&gt;{&lt;/span&gt;bus&lt;span class="o"&gt;}==&lt;/span&gt;&lt;span class="s2"&gt;"0000:08"&lt;/span&gt;, SYMLINK+&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-passthrough"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensured the GPU stayed on the same PCIe bus and avoided the D3cold trap entirely. For Proxmox 8.4 users, I also had to explicitly set &lt;code&gt;-machine q35&lt;/code&gt; in the VM config to prevent QEMU from asserting on boot.&lt;/p&gt;

&lt;p&gt;WHY THIS MATTERS: If you're running non-FLR GPUs like the P40 on Proxmox 8.4 or later, you're likely to hit this issue. It's not just a matter of setting up passthrough — you need to actively prevent the GPU from entering D3cold and lock its PCI address. If you skip either step, you're asking for a bricked GPU. I've seen this happen on more than one occasion, and the fix is always the same: disable D3cold and pin the address.&lt;/p&gt;

&lt;p&gt;This isn't just a Proxmox-specific gotcha. Any system that doesn't support FLR on the GPU and relies on the kernel to manage power states is at risk. If you're using a Tesla P40, T4, or any other non-FLR GPU and you're seeing GPU failures after VM shutdown or reboot, this is the fix you need. I've also seen this issue surface with AMD GPUs under certain conditions, though the fix is slightly different.&lt;/p&gt;

&lt;p&gt;If you're running AI workloads on Kubernetes or any other system that depends on GPU passthrough, this is a critical detail. You don't want to be the one who has to power cycle a node just to get a GPU working again. I've had to do it more than once. It's not fun. The key is to prevent the GPU from ever getting into a state where it can't reset itself.&lt;/p&gt;

&lt;p&gt;For those who are considering moving away from GPU passthrough entirely, I've also found that running the NVIDIA driver directly on the host can be a much more stable option. It avoids all the PCIe bus instability and power state issues. I've tested this with the NVIDIA Container Toolkit and it's worked well for me in production environments.&lt;/p&gt;

&lt;p&gt;I've written this post not because I want to scare you — but because I want to save you from the frustration of a bricked GPU. If you're running Proxmox, using older GPUs, and you've had this issue, you're not alone. I've been there, and I've found a way to avoid it.&lt;/p&gt;

</description>
      <category>gpupassthrough</category>
      <category>d3cold</category>
      <category>proxmox</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>cert-manager + Cloudflare DNS-01: Automated TLS for Everything</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Fri, 24 Apr 2026 04:15:49 +0000</pubDate>
      <link>https://dev.to/futhgar/cert-manager-cloudflare-dns-01-automated-tls-for-everything-2klm</link>
      <guid>https://dev.to/futhgar/cert-manager-cloudflare-dns-01-automated-tls-for-everything-2klm</guid>
      <description>&lt;p&gt;I spent two days chasing a cert-manager error that looked like it was coming from the future. The message was clean: &lt;code&gt;Error: failed to solve challenge: failed to update DNS record: 403 Forbidden&lt;/code&gt;. I had followed the docs, created the API token, set up the ClusterIssuer, and even double-checked the zone. But the error wouldn’t go away. Turns out, the token didn’t have the right scope, and I had no idea. That’s the kind of thing that happens when you skip the part of the documentation that says "make sure your token has these exact permissions."&lt;/p&gt;

&lt;p&gt;If you’re running Kubernetes on bare metal, in a homelab, or in a production environment, you need TLS. You need it for ingress, for internal services, for anything exposed to the internet. cert-manager is the go-to tool for this, and Cloudflare is the go-to DNS provider for a lot of us. But the setup isn’t as straightforward as the docs make it look. I’m going to walk through what I tried first, what actually worked, and why it matters.&lt;/p&gt;

&lt;p&gt;I’m not here to sell you on cert-manager or Cloudflare. I’m here to tell you what I did when I tried to make TLS work for everything in my cluster , and what I had to fix when it didn’t.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tried First
&lt;/h2&gt;

&lt;p&gt;I started with the standard cert-manager installation, using the Helm chart. I followed the example for Cloudflare DNS-01, set up the API token, and created the ClusterIssuer. I used a Kubernetes Secret to store the token, and referenced it in the &lt;code&gt;cloudflare&lt;/code&gt; provider block of the issuer configuration.&lt;/p&gt;

&lt;p&gt;That’s where it went wrong.&lt;/p&gt;

&lt;p&gt;The first error was &lt;code&gt;403 Forbidden&lt;/code&gt;, and I had no idea why. I checked the token’s scope again. I double-checked the zone name. I even created a new token with all the permissions I could think of. Nothing worked. The cert-manager logs just said the same thing again and again.&lt;/p&gt;

&lt;p&gt;I tried looking for similar issues on GitHub, Stack Overflow, and the cert-manager forums. The most common answers were things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Make sure your token has the &lt;code&gt;DNS:Edit&lt;/code&gt; scope"&lt;/li&gt;
&lt;li&gt;"Check that the zone name is correct"&lt;/li&gt;
&lt;li&gt;"Ensure that the API token is not expired"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I had done all of those. And still, it didn’t work.&lt;/p&gt;

&lt;p&gt;Then I thought: maybe the token was created for the wrong zone. I went to the Cloudflare dashboard, created a new token for the exact zone name I was using. I gave it &lt;code&gt;Zone:Read&lt;/code&gt; and &lt;code&gt;DNS:Edit&lt;/code&gt; permissions, and then I re-deployed the ClusterIssuer.&lt;/p&gt;

&lt;p&gt;Still nothing.&lt;/p&gt;

&lt;p&gt;It was at this point that I realized the issue wasn’t the token or the zone , it was the &lt;code&gt;email&lt;/code&gt; field in the provider configuration. I had used the same email that I used to register the domain. But cert-manager requires a Cloudflare account email, not the email used to register the domain. That’s not something you see in the documentation, and that’s exactly the kind of thing that breaks your day.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Solution
&lt;/h2&gt;

&lt;p&gt;Let’s get specific. Here’s what I ended up with:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Create a Cloudflare API Token
&lt;/h3&gt;

&lt;p&gt;Go to the &lt;a href="https://dash.cloudflare.com/profile/api-tokens" rel="noopener noreferrer"&gt;Cloudflare API Tokens dashboard&lt;/a&gt;, and create a new token. Give it the following permissions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zone:Read&lt;/strong&gt; (for reading the zone information)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS:Edit&lt;/strong&gt; (for updating DNS records during the DNS-01 challenge)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make sure the token is scoped to the exact zone you're using, not the entire account.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Store the API Token in a Kubernetes Secret
&lt;/h3&gt;

&lt;p&gt;Create a Kubernetes Secret to store your Cloudflare API token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic cloudflare-api-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api-token&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-cloudflare-api-token-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cert-manager
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure cert-manager with the Cloudflare DNS-01 Provider
&lt;/h3&gt;

&lt;p&gt;Here’s the complete configuration for the &lt;code&gt;ClusterIssuer&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cert-manager.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIssuer&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudflare&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;acme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-cloudflare-account-email@example.com"&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://acme-v02.api.sandbox.cloudfla.re&lt;/span&gt;
    &lt;span class="na"&gt;privateKeySecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudflare-acme-account-key&lt;/span&gt;
    &lt;span class="na"&gt;solvers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;dnsZones&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example.com"&lt;/span&gt;
        &lt;span class="na"&gt;dns01&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cloudflare&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;apiTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudflare-api-token&lt;/span&gt;
              &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure to replace &lt;code&gt;your-cloudflare-account-email@example.com&lt;/code&gt; with the actual email address associated with your Cloudflare account, not the one you used to register the domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Deploy a Sample Ingress to Test
&lt;/h3&gt;

&lt;p&gt;Create a simple Ingress to test if cert-manager can issue a certificate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-ingress&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/ingress.class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nginx"&lt;/span&gt;
    &lt;span class="na"&gt;cert-manager.io/cluster-issuer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudflare"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example.com"&lt;/span&gt;
      &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;example-com-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once this is deployed, cert-manager should automatically request and issue a certificate for &lt;code&gt;example.com&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Use SealedSecrets for Secure Credential Management (Optional)
&lt;/h3&gt;

&lt;p&gt;If you want to store your Cloudflare API token securely, you can use SealedSecrets. This is especially useful if you're using GitOps or want to ensure that secrets aren't stored in plain text in your version control system.&lt;/p&gt;

&lt;p&gt;First, install SealedSecrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/bitnami-labs/sealed-secrets/releases/latest/download/controller.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, seal your secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubeseal &lt;span class="nt"&gt;--cert&lt;/span&gt; ./sealed-secrets/public-key.pem &amp;lt; cloudflare-api-token.yaml &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; cloudflare-api-token-sealed.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the sealed secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; cloudflare-api-token-sealed.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And update your &lt;code&gt;ClusterIssuer&lt;/code&gt; to reference the sealed secret:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudflare-api-token-sealed&lt;/span&gt;
  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why It Works
&lt;/h2&gt;

&lt;p&gt;cert-manager uses the ACME protocol to issue certificates. When you use the DNS-01 challenge, cert-manager needs to modify a DNS record to prove that it has control over the domain. This is where Cloudflare comes in , it allows cert-manager to temporarily update a DNS record on your behalf.&lt;/p&gt;

&lt;p&gt;The key part here is the &lt;code&gt;cloudflare&lt;/code&gt; provider configuration in the &lt;code&gt;ClusterIssuer&lt;/code&gt;. It tells cert-manager how to authenticate with Cloudflare using the API token. The &lt;code&gt;email&lt;/code&gt; field must be the email associated with the Cloudflare account, not the domain's registration email.&lt;/p&gt;

&lt;p&gt;Cloudflare's API is rate-limited, and if you're not careful, you can hit these limits and cause certificate issuance to fail. That’s why it's important to ensure your API token has the correct scopes and that you're using the right email.&lt;/p&gt;

&lt;p&gt;Also, cert-manager requires that the zone in the &lt;code&gt;selector.dnsZones&lt;/code&gt; field matches the domain exactly. If you're using a wildcard, like &lt;code&gt;*.example.com&lt;/code&gt;, make sure to include that in your configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the right email.&lt;/strong&gt; The Cloudflare email is not the same as the domain's registration email. This is the first thing I missed, and it took me hours to figure out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token scopes matter.&lt;/strong&gt; If your token doesn’t have the right permissions, cert-manager can't update the DNS records. Always double-check that your token has &lt;code&gt;Zone:Read&lt;/code&gt; and &lt;code&gt;DNS:Edit&lt;/code&gt; permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure your secrets.&lt;/strong&gt; Using SealedSecrets is a great way to keep your Cloudflare API token safe. It adds an extra layer of security and ensures that your secrets aren't exposed in your version control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test everything.&lt;/strong&gt; Don’t assume that just because the configuration looks right, it will work. Create a test Ingress and watch the cert-manager logs to see what’s happening under the hood.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch for API limits.&lt;/strong&gt; Cloudflare's API has rate limits, and if you’re issuing a lot of certificates, you could hit those limits. It’s a good idea to monitor your usage and set up alerts if needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also found that cert-manager v1.13+ has stricter validation for DNS01 providers. Older versions might not show errors clearly, which can make debugging a lot harder. I ended up using v1.14.0, which had better error messages and worked more reliably with Cloudflare.&lt;/p&gt;

&lt;p&gt;If you're using a dynamic IP setup, like with a home network or a cloud provider that assigns dynamic IPs, you’ll need to set up a DDNS update mechanism. I used a CronJob with GitOps to ensure that my Cloudflare A record was always up to date. That way, even if my IP changes, the certificate can still be issued and renewed.&lt;/p&gt;

&lt;p&gt;I also had to deal with a bug in my Helm chart that was using an old version of cert-manager. The Helm chart I was using had a misconfigured &lt;code&gt;targetRevision&lt;/code&gt;, which was causing the issuer to fail silently. I had to manually update the &lt;code&gt;targetRevision&lt;/code&gt; to match the version I was using.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Automating TLS with cert-manager and Cloudflare DNS-01 is a powerful combination. It saves you from manually issuing certificates and keeps your cluster secure without the overhead of managing them by hand. But it’s not without its gotchas , and I’ve had my fair share of them.&lt;/p&gt;

&lt;p&gt;If you're new to cert-manager or Cloudflare, I recommend starting small. Create a test Ingress, watch the logs, and make sure everything works before rolling it out to production. And if you hit a roadblock, don’t be afraid to check the cert-manager logs , they often give you the exact error you need to fix the problem.&lt;/p&gt;

&lt;p&gt;You can also find more information on how to set up cert-manager with Cloudflare in the official documentation, or in some of the other posts on guatulabs.dev about Kubernetes and infrastructure. If you're interested in more advanced topics, like setting up a GitOps pipeline with ArgoCD or using SealedSecrets for secure credential management, those are also worth reading.&lt;/p&gt;

&lt;p&gt;In the end, the goal is to make TLS work for everything , and that’s what cert-manager is all about.&lt;/p&gt;

</description>
      <category>certmanager</category>
      <category>cloudflare</category>
      <category>kubernetes</category>
      <category>tls</category>
    </item>
    <item>
      <title>SealedSecrets Key Backup: Don't Lose Your Encryption Keys</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Wed, 22 Apr 2026 22:15:49 +0000</pubDate>
      <link>https://dev.to/futhgar/sealedsecrets-key-backup-dont-lose-your-encryption-keys-18ad</link>
      <guid>https://dev.to/futhgar/sealedsecrets-key-backup-dont-lose-your-encryption-keys-18ad</guid>
      <description>&lt;p&gt;I lost access to a SealedSecrets key once , not because I deleted it, but because I didn't know where it was stored. The cluster kept running, the apps kept deploying, but the moment I tried to rotate the key or redeploy a sealed secret, I hit a wall. The controller couldn't decrypt anything. The only way out was to find the original key, and I had to dig through old manifests and cluster logs to get it back. That’s when I learned the hard way: SealedSecrets keys aren’t magical. They’re just Kubernetes secrets, and they can be lost if you don’t back them up.&lt;/p&gt;

&lt;p&gt;The SealedSecrets controller uses a single key to encrypt and decrypt secrets. If that key is lost, all your sealed secrets become unusable. You can’t just regenerate it , the encryption is tied to that specific key. The key is stored as a Kubernetes secret in the &lt;code&gt;sealed-secrets&lt;/code&gt; namespace. If you don’t back it up, and it gets deleted or corrupted, you're out of luck.&lt;/p&gt;

&lt;p&gt;Here’s the command I use to back it up. It exports the key to a YAML file, which I store off-cluster in version control or a secure backup system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get secret sealed-secrets-key &lt;span class="nt"&gt;-n&lt;/span&gt; sealed-secrets &lt;span class="nt"&gt;-o&lt;/span&gt; yaml &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; sealed-secrets-key-backup.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the only way to ensure you can recover from a key loss. If you're using GitOps tools like ArgoCD, make sure this backup is part of your repo and included in your CI/CD pipeline. Otherwise, the moment you redeploy the sealed-secrets controller, the key could be lost if it's not versioned.&lt;/p&gt;

&lt;p&gt;If you lose the key, the only way to recover is to restore it from a backup. You can do that by applying the YAML file back into the cluster. Just make sure the namespace and secret name match the original. If you're using ArgoCD, you may need to disable the sealed-secrets app, apply the key, and then re-enable it to avoid reconciliation conflicts.&lt;/p&gt;

&lt;p&gt;Don’t assume the key is safe just because it's in the cluster. Back it up, version it, and keep it somewhere you can get to when you need it. That’s the only way to stay ahead of a potential outage.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>sealedsecrets</category>
      <category>encryption</category>
      <category>keymanagement</category>
    </item>
    <item>
      <title>Ollama on Kubernetes: Recreate Strategy and Single-GPU Deadlock</title>
      <dc:creator>Guatu</dc:creator>
      <pubDate>Tue, 21 Apr 2026 20:16:02 +0000</pubDate>
      <link>https://dev.to/futhgar/ollama-on-kubernetes-recreate-strategy-and-single-gpu-deadlock-g8e</link>
      <guid>https://dev.to/futhgar/ollama-on-kubernetes-recreate-strategy-and-single-gpu-deadlock-g8e</guid>
      <description>&lt;p&gt;I deployed Ollama on Kubernetes, and the GPU worker node locked up mid-rollout. No logs, no error, just a dead pod that wouldn’t terminate and a new one that wouldn’t schedule. It wasn’t a crash. It wasn’t a timeout. It was a deadlock I’d never seen before.&lt;/p&gt;

&lt;p&gt;I expected a smooth rollout. Ollama is a single-container, single-GPU workload. I set up a Deployment with a single replica, used a PersistentVolumeClaim for model storage, and assumed Kubernetes would manage the rest. That’s what the documentation says.&lt;/p&gt;

&lt;p&gt;What actually happened was a scheduling deadlock. The old pod was still running, using the GPU, but the new pod couldn’t schedule because the GPU was in use. Kubernetes’ default RollingUpdate strategy tried to keep one pod running while replacing the other, but the GPU couldn’t be shared. The new pod waited for the old one to terminate, and the old pod waited for the new one to start. Deadlock.&lt;/p&gt;

&lt;p&gt;The fix was switching the Deployment strategy from RollingUpdate to Recreate. That way, the old pod terminates before the new one starts. No GPU contention. No deadlock. It’s a simple change , just set &lt;code&gt;type: Recreate&lt;/code&gt; in the Deployment spec.&lt;/p&gt;

&lt;p&gt;Here’s what the Deployment looks like with the fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Recreate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also had to configure the NVIDIA runtime correctly. Ollama needs the &lt;code&gt;NVIDIA_VISIBLE_DEVICES=all&lt;/code&gt; environment variable set, and the NVIDIA container runtime must be properly mounted. Otherwise, the container fails to initialize, and the pod stays in a CrashLoopBackOff state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama&lt;/span&gt;
      &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_VISIBLE_DEVICES&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
      &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-driver&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/usr/local/nvidia&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-driver&lt;/span&gt;
      &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/usr/local/nvidia&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why does this matter? If you’re running any GPU workload on Kubernetes , especially ones that require exclusive access to the GPU , you need to understand the limitations of RollingUpdate. It’s not a one-size-fits-all strategy. For GPU workloads, Recreate is the only safe option. Otherwise, you’ll hit deadlocks that leave your pods in limbo.&lt;/p&gt;

&lt;p&gt;Another gotcha I ran into was PVC sizing. Ollama models can be large , some of the larger ones require over 100Gi of storage. I initially set the PVC to 50Gi, and the pod wouldn’t schedule. The PVC couldn’t be bound because the node didn’t have enough storage capacity. I had to bump the PVC size and make sure the underlying storage class (Longhorn in my case) had enough available space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama-pvc&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re running Ollama on Kubernetes, you’ll need to be mindful of these details. A single misconfigured PVC or deployment strategy can bring your whole system to a halt. I’ve seen it happen more than once. It’s not just about getting the container to start , it’s about making sure it stays up and doesn’t lock the system in the process.&lt;/p&gt;

&lt;p&gt;If you’re building AI agent workloads or running large models in production, this is a common pitfall. The documentation doesn’t always highlight the GPU-specific constraints of Kubernetes. But when you’re working with real hardware, those constraints are real. And when they bite, you’ll wish you’d read about them before it’s too late.&lt;/p&gt;

&lt;p&gt;For more on GPU workloads and Kubernetes, check out &lt;a href="https://guatulabs.dev/posts/nvidia-container-toolkit-why-the-default-runtime-matters" rel="noopener noreferrer"&gt;NVIDIA Container Toolkit: Why the Default Runtime Matters&lt;/a&gt;. If you’re using Longhorn for storage, &lt;a href="https://guatulabs.dev/posts/kubernetes-storage-on-bare-metal-longhorn-in-practice" rel="noopener noreferrer"&gt;Kubernetes Storage on Bare Metal: Longhorn in Practice&lt;/a&gt; is a good next step.&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>kubernetes</category>
      <category>gpudeadlock</category>
      <category>recreatestrategy</category>
    </item>
  </channel>
</rss>
