Phil Rentier Digital

Posted on Apr 3 • Originally published at rentierdigital.xyz

I'm a Control Freak. My Mesh VPN Should Be Too.

#technology #devops #selfhosting #networking

A private mesh VPN puts all your dev infrastructure on the same network. VPS servers, your Mac, your phone. Everything talks through encrypted tunnels instead of the open internet. You stop exposing SSH ports, internal APIs, database connections to the outside world. Less surface, less risk, simpler firewall rules.

Tailscale does this well. The tunnel is P2P, the setup takes 30 seconds, and you forget it's there. Except the coordination server, the thing that lets your machines find each other, lives on Tailscale's infrastructure. And when you run multi-VPS production workloads, "on their infrastructure" is a dependency you didn't sign up for.

I got burned once before by a vendor changing the rules mid-game on my production setup. Once is enough. So I migrated to NetBird, fully self-hosted, behind the Traefik reverse proxy already running on my VPS. The mesh itself took 2 minutes per client. The real work was everything else.

TLDR: NetBird replaces Tailscale with a full self-hosted WireGuard P2P mesh. The network is up in 2 minutes. But if you already have Traefik in production, the official setup script is unusable. This article gives you the custom docker-compose, the path-based routing with priorities, and the 9 gotchas that neither the docs nor any tutorial mention.

Tailscale Works. That's Not the Point.

Tailscale is genuinely good software. I ran it for months. P2P connections via WireGuard, NAT traversal that actually works, a clean admin panel. No complaints on the tunnel side.

The problem is structural. The coordination server (Tailscale calls it their "control plane") is a SaaS you don't own. Every time your machines need to discover each other, they phone home. If Tailscale changes their pricing, kills the free tier, or has a bad day, your private network can't establish new connections. Existing tunnels survive (they're WireGuard underneath), but no new peers, no re-keying, no changes.

For a weekend homelab project, that's fine enough. For production VPS infrastructure where I run client services, it's a bet I don't want to keep making. Not paranoia. Just the same reflex any dev gets after watching a vendor pull the rug.

The fix was obvious. Self-host the coordination layer.

I Looked at Every Alternative. Most Are Dead Ends.

The shortlist kills itself fast:

Headscale reverse-engineers the Tailscale protocol. That means you still depend on Tailscale's client apps, and Tailscale has no obligation to keep their protocol stable for a third-party server. The community has documented this risk extensively. Your self-hosted coordination server could break on any Tailscale client update you don't control.

ZeroTier uses a custom protocol, not WireGuard. That's a dealbreaker for anyone who wants standard, audited crypto underneath.

Nebula (Slack's mesh tool) has no native iOS client. My phone is a peer. Non-negotiable.

Netmaker has documented instability issues across community reports. I don't want to debug my networking layer.

Raw WireGuard is rock solid but has no automatic mesh discovery. You're manually managing peer configs on every node. With 3 peers that's annoying. With 10 it's a job. With 30 it's a career change you didn't apply for.

NetBird is the one that survived elimination. Full open-source (client AND coordination server, not just one side). Native iOS app on the App Store. WireGuard under the hood with Rosenpass post-quantum layer on top. Local user management since v0.62 (no external identity provider needed for small setups). Combined Docker image that bundles management, signal, relay, and STUN in one container.

And yes, NetBird is younger than Tailscale. The third-party integration ecosystem is thinner. You won't find the same plug-and-play connectors for every SaaS tool. But I'm not looking for an ecosystem. I'm looking for a mesh I own.

The Official Script Is Built for People with Nothing in Production.

I went to the NetBird docs. Clicked "Self-hosting." Downloaded getting-started.sh.

Then I see that the script deploys its own Traefik instance.

I already have Traefik v3 running on this VPS. It handles routing for client sites, apps, dashboards. It has Let's Encrypt certs, middleware chains, the whole thing. I'm not tearing that down and I'm not running two Traefik instances on the same machine fighting over port 443.

So the official script goes in the trash and you write yourself a custom docker-compose that plugs NetBird's services into your existing Traefik. And that's where the real work starts.

The routing is the tricky part. Everything lives on a single domain (netbird.yourdomain.com), separated by path and priority. The logic:

Priority 100 (highest): gRPC services (signal and management API). These need the h2c scheme because Traefik terminates TLS and forwards plain HTTP/2 to the containers. If you use http instead of h2c, gRPC silently fails and the client stays stuck on "Disconnected."

Priority 50: WebSocket connections for the relay service, REST API endpoints, and OAuth2 callback routes. Standard HTTP, nothing exotic.

Priority 1 (catch-all): The dashboard UI. Everything that doesn't match a higher-priority rule lands here.

Coturn (TURN/STUN server): Runs in network_mode: host, completely outside Traefik. It needs raw UDP on ports 3478 and 49152-65535. No reverse proxy can help you here.

Traefik Gateway with NetBird P2P Mesh Architecture

If you've done Traefik routing before, this is familiar territory. If you haven't, each of those priority levels took me a round of "why is this 502-ing" to get right.

The 9 Gotchas Nobody Warned Me About

I aged visibly during the Dex dashboard setup. Four iterations on the login screen before it worked. Most of my 2-hour setup was spent on this list, not on the mesh.

1. exposedAddress needs the port.In config.yaml, the exposedAddress field for the signal server must include :443. The gRPC client parser fails with "missing port in address" if you omit it. The signal service stays "Disconnected" with zero useful error messages.

2. Blank screen after login. No error.The dashboard JavaScript does window.location.origin + redirectURI. If your AUTH_REDIRECT_URI is a full URL instead of a relative path, it doubles up: https://netbird.yourdomain.com/https://netbird.yourdomain.com/nb-auth. Set it to /nb-auth, not the full domain. Good luck debugging that without reading the source.

3. AUTH_CLIENT_ID is "netbird-dashboard", not "dashboard".The Dex static client config expects netbird-dashboard. If you use dashboard, authentication fails silently. No error in the UI, no useful log line. You just can't log in.

4. Dex scopes are not Auth0 scopes.The correct scopes for Dex are openid profile email groups. If you copy-paste from Auth0 examples (and many blog posts use Auth0), you'll add api and email_verified. Those scopes don't exist in Dex. The token request fails.

5. Container starts, logs look clean, login doesn't work.The dashboard init script assumes Auth0 by default. You need AUTH_SUPPORTED_SCOPES=openid profile email groups and USE_AUTH0=false in the dashboard env vars. Without both, the script crashes silently. No error, no warning. Just a login page that rejects every attempt.

6. store.encryptionKey must be in config.yaml.Without it, the management server generates a random key on startup. Restart the container and it can't decrypt its own data. Set it once, keep it forever. (Or enjoy re-registering every peer after a reboot.)

7. Routes that are clearly defined return 404.You check the label syntax. Correct. You check the path. Correct. You check the priority. Correct. Turns out Docker Compose YAML and shell heredocs handle backticks differently. If your Traefik labels use backtick-delimited path matchers, wrap them in double quotes in the YAML. Heredocs produce valid-looking but subtly broken rules. Maddening.

8. Coturn bypasses your firewall. network_mode: host means Coturn skips Docker's network stack entirely. It also skips UFW. You don't need to open the relay port range 49152-65535 because the container already listens on the host directly. Sounds convenient until you realize your firewall rules are decorative for that container.

9. Cloudflare proxy must stay off. The orange cloud icon in Cloudflare DNS breaks everything: gRPC, WebSocket, and UDP. Set it to DNS only (grey icon). If you forget, the dashboard loads but nothing behind it works. Another silent failure.

The pattern across all nine: the failure mode is silence. No crash, no stack trace. Things just don't work, and the logs say everything is fine.

2 Hours, 3 Peers, 2ms.

The final architecture: management server on a Contabo VPS (Ubuntu 24.04, 8GB RAM). One combined netbird-server container running management, signal, relay, and STUN. Separate containers for netbird-dashboard and netbird-coturn. All behind the existing Traefik instance except Coturn.

Three peers on the mesh: the Contabo VPS (Linux), a Hostinger VPS (Linux), and my MacBook (arm64). Ping between the two VPS nodes: ~2ms P2P. That's direct WireGuard, no relay hop.

First time I ran that ping and saw the number come back, I just sat there for a second. Two milliseconds across two datacenters on a network I built myself. No third-party relay, no coordination server in someone else's cloud. Just my boxes talking to each other.

I almost texted someone about it. Then I remembered normal people don't get excited about ping times.

The macOS client installs via Homebrew. Reconnects cleanly after sleep. The iOS client is a native App Store download, works on 4G/5G with NAT traversal through Coturn when direct P2P isn't possible.

Authentication runs through embedded Dex with local users. No external identity provider, no Okta, no Auth0. For a one-person (or three-person) setup, local users are all you need.

The migration from Tailscale was the simplest part. Uninstall the Tailscale client, nothing breaks on the service side (your apps don't know what VPN runs underneath). Update SSH configs to point to the new mesh IPs. Done.

The same VPS already runs other self-hosted services behind the same Traefik proxy. Same pattern, same stack, same philosophy: if it's critical to your workflow, own it.

Two hours total. Ninety minutes of those on the dashboard authentication circus.

When the Management Server Dies, the Tunnels Don't.

This is the part that makes the whole migration worth it.

If my management server goes down, the existing P2P tunnels keep running. WireGuard sessions between peers that already know each other survive. The dashboard and signal server go dark, so no new peers can join and no config changes propagate. But production traffic between connected machines continues uninterrupted. That's a property of WireGuard, not a NetBird feature. But it's a property you only get when the coordination layer runs next to your other services, not on someone else's cloud.

The backup is two things: the config files (docker-compose, config.yaml, turnserver.conf) and a compressed tar of the Docker volume (store.db, idp.db, encryption keys). Both replicated to the second VPS. Recovery is seven steps: install Docker, copy the config, restore the volume, create the Traefik network, update DNS to the new IP, docker compose up -d, then reconnect each client with netbird down && netbird up. I tested the full procedure. It works.

I also moved the management domain to something less obvious. Security through obscurity is not a strategy, but there's no reason to advertise your coordination endpoint either.

The industry will keep selling you "zero config" and "it just works." And they're right, it works. Until the vendor changes pricing, kills a tier, or goes down. Then you find out your private network depended on a server you never owned.

The mesh VPN is not the complicated part. It's 2 minutes and 2ms. The complicated part is wiring it into a stack that already runs. But once it's done, it's yours. Management server dies, tunnels survive. You want to migrate, you migrate.

Owning your infra is being free and secure. 🤷

Sources

(*) The cover is AI-generated. The WireGuard tunnels, however, are very much real and routing packets as we speak.

This article contains affiliate links. I may earn a small commission if you purchase through them.

DEV Community