Rob

Posted on May 7 • Originally published at vibescoder.dev

Invisible Failures: The Bugs That Hide in Plain Sight

#homelab #agents #devops #debugging

Lucky you — bonus fix content, and you don't even have to wait until Friday.

I had a work trip to Austin coming up. The homelab was humming along at home, but I realized something uncomfortable: if anything went sideways while I was gone, I had no reliable way to fix it. SSH works when everything is running. SSH doesn't help when you need to see a stuck GUI dialog, a frozen window manager, or a service that needs a browser to configure.

So before I left, I fixed the access problem. Then, from a hotel room in Austin, I found and fixed three bugs that had been silently breaking things for days. None of them crashed. None of them logged errors. They just quietly did the wrong thing until the consequences finally became visible.

1. Setting Up Remote Access Before the Trip

I'd been putting off remote desktop because "I can always walk over to it." A trip to Austin fixed that mindset.

Evaluating options: Compared five tools. xrdp is a common recommendation but it's wrong for this use case — it spawns a new desktop session instead of mirroring the existing one. If something is stuck on the real display, xrdp can't help you see it. VNC works but it's laggy and unencrypted by default. Chrome Remote Desktop depends on Google's servers and a running Chrome instance. NoMachine is great but closed source.

RustDesk won: Open source, self-hostable, mirrors the real desktop, has iOS and macOS clients.

Self-hosted server: Two Docker containers — a rendezvous server and a relay server — so connections route through my own infrastructure instead of RustDesk's public relays:

sudo docker run -d --name rustdesk-hbbs \
  --restart always \
  -p 21115:21115 -p 21116:21116 -p 21116:21116/udp -p 21118:21118 \
  -v /opt/rustdesk-server:/root \
  rustdesk/rustdesk-server hbbs

sudo docker run -d --name rustdesk-hbbr \
  --restart always \
  -p 21117:21117 -p 21119:21119 \
  -v /opt/rustdesk-server:/root \
  rustdesk/rustdesk-server hbbr

Networking via Tailscale: Cloudflare Tunnels already handle coder.vibescoder.dev, but they only proxy TCP/HTTP — RustDesk's rendezvous server requires UDP on port 21116. Tailscale is a WireGuard mesh VPN that handles TCP and UDP natively. Each tool has its lane:

Cloudflare Tunnel → HTTP/S services (Coder dashboard, blog)
Tailscale → everything else (SSH, RustDesk, any UDP/TCP service)

Clients on macOS and iOS point at the workstation's Tailscale IP. A permanent password means fully remote access — no need to walk over and click "Accept" on a popup, which defeats the entire purpose of remote desktop for recovery scenarios.

Everything auto-starts on reboot: Docker containers with --restart always, RustDesk and Tailscale as systemd services. Five components confirmed persistent across power cycles.

The architecture now looks like this:

┌───────────────────────────────────────────────────────┐
│              HOMELAB (Ubuntu + RTX 5090)               │
│                                                       │
│  Coder Server ──── Cloudflare Tunnel ──── Internet    │
│  (systemd, :3000)   (TCP/HTTP only)                   │
│                                                       │
│  RustDesk Client ── Tailscale Mesh ──── MacBook       │
│  (systemd)           (TCP + UDP)        iPhone        │
│                                                       │
│  RustDesk Server (Docker, --restart always)            │
│  ├── hbbs (rendezvous, :21115-21116, :21118)          │
│  └── hbbr (relay, :21117, :21119)                     │
│                                                       │
│  llama-server (Gemma 4, :8080)                        │
│  Tailscale (systemd)                                  │
│  Cloudflared (systemd, tunnel)                        │
└───────────────────────────────────────────────────────┘

With that in place, I headed to Austin.

2. The Deploy That Only Fails on New Content

Three out of four Vercel deploys crashed with the same error:

Can't load image https://vibescoder.dev/images/downtime-is-a-feature/vercel-dns-ipv4-error.png: fetch failed
Error: Image size cannot be determined.
Export encountered an error on /posts/[slug]/opengraph-image/route

The blog generates dynamic OpenGraph cards for social sharing — each post gets a unique 1200×630 image with the title, description, and a faded background pulled from the post's first image. The OG image route extracts the first ![alt](/images/...) reference from the markdown and renders it.

The problem was how it loaded that image:

<img
  src={`https://vibescoder.dev${firstImage}`}
  style={{ width: "100%", height: "100%", objectFit: "cover", opacity: 0.15 }}
/>

During next build, this fetches from the live production site. For a new post, those images don't exist on production yet — they're only in the current build's public/ directory, copied there by the prebuild script. The fetch fails, next/og can't determine dimensions, and the entire build crashes.

The one post that succeeded? Its first image already existed on production from a previous deploy.

This is a classic chicken-and-egg bug. It only affects new content with new images. If you redeploy the same content twice, it works — because the first deploy put the images on the live site. You could publish for weeks without hitting it, then get three failures in a row when you finally add a post with a fresh screenshot.

The fix: Read from the local filesystem instead of fetching from production. The images are already in public/images/ at build time, so we read from disk and encode as a base64 data URI:

const imgPath = path.join(process.cwd(), "public", rawImage);
const buf = fs.readFileSync(imgPath);
const ext = path.extname(rawImage).replace(".", "").toLowerCase();
const mime = ext === "jpg" || ext === "jpeg" ? "image/jpeg" : `image/${ext}`;
firstImage = `data:${mime};base64,${buf.toString("base64")}`;

A try/catch around the read means a missing image degrades gracefully — no background in the OG card instead of a build-killing crash.

3. The Shell Guard That Eats Your Auth

While pushing the OG image fix, git push failed with an auth error. This had happened before. Every time, we'd manually run coder external-auth access-token github, paste the token into the remote URL, and move on. This time we decided to actually trace it.

Three layers of GitHub auth were configured in the workspace startup script, and all three were broken in agent sessions:

Component	What It Did	Why It Failed
`credential.helper`	Shell function reading `$GITHUB_TOKEN`	Env var was empty
`GITHUB_TOKEN` export	Appended to `~/.bashrc`	Below the interactive guard
`gh auth login`	Ran at startup	Token persisted, but `gh` also checks `GH_TOKEN` env

The startup script appended export GITHUB_TOKEN=... to ~/.bashrc. That line landed at line 118 — after the interactive guard at line 8:

case $- in *i*) ;; *) return;; esac

Every non-interactive shell — which is what Coder agent execute() calls use — bailed out at line 8 and never reached the exports. The credential helper then read an empty $GITHUB_TOKEN and returned an empty password. Git got a 401. The agent worked around it. Nobody noticed.

The fix was three changes:

Credential helper calls coder external-auth directly — no dependency on env vars, fresh token on every git operation:

git config --global credential.helper \
  '!f() { echo "username=x-access-token"; echo "password=$(coder external-auth access-token github 2>/dev/null)"; }; f'

Coder's env block injects tokens via Terraform — set in the agent process environment by Coder itself, inherited by every execute() call, no shell sourcing needed:

resource "coder_agent" "main" {
  env = {
    GITHUB_TOKEN = data.coder_external_auth.github.access_token
    GH_TOKEN     = data.coder_external_auth.github.access_token
  }
}

Cleanup of stale .bashrc entries — removed the old exports so they don't confuse future debugging.

The auth bug had been hiding for multiple sessions. The startup script looked correct. The agent silently worked around it every time. The fix was to stop relying on shell init files entirely and let Coder's process environment do the work at a level the shell can't interfere with.

4. The Date That Froze at Draft Creation

Published "The Agentic Gap" post. It went live. Then noticed the date: April 26 — three days ago. The post appeared sorted behind three days of other content instead of at the top.

The blog engine had no mechanism to update the frontmatter date when a draft is published. The flow:

Draft created → Claude sets date to the day it generates the content
Draft sits unpublished for N days
Someone flips published: false → published: true
Post appears on the blog sorted by its creation date, not its publish date

Two publish paths existed — the admin UI and the API — and neither touched the date. The admin UI did a single regex replace on the boolean. The API had a fixDateYear() function that corrects stale years (e.g., "2025" when it's 2026), but same-year drafts sailed right through.

The fix was two layers:

Client-side: handlePublish() stamps today's date immediately after flipping the boolean:

const today = new Date().toISOString().split("T")[0];
published = published.replace(
  /^date:\s*'[^']*'/m,
  `date: '${today}'`,
);

Server-side: A new stampPublishDate() function detects the false → true transition by comparing old and new content, and rewrites the date if it's a fresh publish. Safety net for all paths.

There's a third publish path — an agent directly editing the MDX and pushing to git, which is exactly what caused this bug. No code can fix that path. The fix there is process: the agent skill now instructs agents to always set the date to today when publishing.

The meta moment: the agent that introduced this bug also diagnosed, fixed, and documented it. In about fifteen minutes.

5. Capacity Planning: How Many Workspaces Can This Machine Run?

With the homelab now running Coder, Gemma 4 via llama.cpp, RustDesk, Tailscale, and Cloudflare — the question became: how much headroom is left?

The inventory: Ryzen 9 9950X3D, 64 GB RAM, RTX 5090 32 GB VRAM. Profiled from inside a container via /proc and from the host via Termius on an iPhone.

Key findings:

Category	Memory
Host services (GNOME, Coder, Docker, Tailscale, Cloudflare, RustDesk)	~5 GB
7 idle workspaces	~1.7 GB (230–270 MB each)
Gemma 4 (32K context, llama.cpp)	~19 GB VRAM — zero RAM impact
Available for workspaces	~58 GB

The GPU insight: Gemma 4 runs entirely in VRAM on the RTX 5090. Zero system RAM impact. Workspaces call it via the OpenAI-compatible API on the host. The 32 GB VRAM pool is completely separate from the 64 GB system RAM — running a local LLM doesn't reduce workspace capacity at all.

Capacity estimates: 8–12 active agent sessions comfortably (CPU is the bottleneck, not RAM). Dozens of idle workspaces parked at ~250 MB each.

Resource guardrails: Added an 8 GB per-container memory limit as a safety net — 32× normal usage, so it never constrains normal work, but a runaway npm install or memory leak gets OOM-killed cleanly instead of dragging the whole host into swap:

resource "docker_container" "workspace" {
  memory      = 8192  # MB — safety net, not a throttle
  memory_swap = 8192  # equal to memory = no swap for container
}

Autostop: 2-hour default TTL with 1-hour activity bump. Workspaces auto-stop when forgotten. All configurable via coder templates edit — template metadata, not Terraform.

What I Learned

Set up remote access before you need it. Every one of the fixes below happened from a hotel room because I'd set up RustDesk and Tailscale the day before I left. If I'd waited, I'd have come home to three days of broken deploys and a post sorted in the wrong place.

Invisible bugs are the most expensive. The deploy bug only hit new content. The auth bug was silently worked around. The date bug was three days stale before anyone noticed.

Shell init files are a liability. Anything that depends on .bashrc or .profile is fragile by default. Non-interactive shells, cron jobs, agent tool calls — none of them source your profile. If auth or config needs to be available everywhere, put it in the process environment or in a tool that's always in $PATH.

Self-hosted doesn't mean unmanaged. Adding RustDesk, Tailscale, and resource guardrails isn't gold-plating — it's the difference between a homelab that works when you're sitting in front of it and one that works when you're debugging from a phone in another room. Recovery access is table stakes.

The agent finds its own bugs. The publish-date bug was introduced by an agent, discovered by a human, then diagnosed, fixed, and documented by the same agent. That loop — agent ships, human reviews, agent fixes — is becoming the default workflow.

By the Numbers

1 trip to Austin that forced the remote access setup
4 invisible bugs found and fixed — three of them remotely
3 failed Vercel deploys from the OG image chicken-and-egg bug
3 layers of broken GitHub auth (credential helper, env var, gh CLI)
3 days a post was live with the wrong date before anyone noticed
~58 GB RAM available for workspaces after all services running
8–12 concurrent active agent sessions the workstation can handle
8 GB per-container memory limit as a safety net
5 tools evaluated for remote desktop — RustDesk won
2 Docker containers for self-hosted RustDesk server
5 services confirmed persistent across reboots
0 ports exposed to the internet (Tailscale mesh, Cloudflare tunnel)
~15 minutes from "the date is wrong" to fix deployed across both repos
~30 minutes from "how do I remote desktop" to working iPhone → Linux access
1 agent that introduced a bug, then found and fixed it

DEV Community