Zoltan Toma

Posted on Nov 11 • Originally published at zoltantoma.com on Nov 10

When Your Quick Sunday Feature Takes All Day: WSL2 Multi-VM Networking

#vagrant #wsl2 #networking #systemd

“This Should Be Quick”

Famous last words.

After implementing basic networking support and learning about WSL2’s shared network architecture, I wanted to make it work reliably across multiple distributions. The plan: add integration tests, support more distros (Debian, Fedora, AlmaLinux, Kali, openSUSE), and document the limitations properly.

What I thought would be a quick Sunday afternoon turned into a deep dive into systemd services, DNS resolution, and why background processes in SSH are surprisingly hard.

Claude: Can confirm. We went from “let’s add a test” to “why is DNS broken on Debian” to “let’s refactor everything to use systemd services” in about 4 hours.

The Multi-Distro Challenge

Ubuntu was easy - it has netplan. But what about:

Debian - Uses systemd-networkd
Fedora/AlmaLinux/Kali - Use NetworkManager
openSUSE - Uses wicked

Each has its own network configuration system. My first instinct: support each one natively.

def write_netplan_config
  # Ubuntu with netplan
  netplan_config = <<~YAML
    network:
      version: 2
      ethernets:
        eth0:
          dhcp4: true
          addresses:
            - #{ip}/#{prefix}
  YAML

  @machine.communicate.sudo("netplan apply")
end

def write_networkmanager_config
  # Fedora/AlmaLinux/Kali
  @machine.communicate.sudo(
    "nmcli connection modify 'eth0' +ipv4.addresses #{ip}/#{prefix}"
  )
  @machine.communicate.sudo("nmcli connection up 'eth0'")
end

def write_systemd_networkd_config
  # Debian
  # ... and so on
end

Seemed reasonable, right?

The DNS Problem

Debian was the first to break.

vm2: Warning: Failed to fetch http://deb.debian.org/debian/dists/trixie/InRelease
vm2: Temporary failure resolving 'deb.debian.org'

The static IP was configured, but DNS stopped working after the provision script ran. Why?

After some debugging:

vagrant ssh vm2 -c "ip a show eth0"
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
    inet 192.168.50.11/24 scope global eth0
    # Where's the WSL2 DHCP IP (172.x.x.x)?

Oh. The systemd-networkd restart command wiped out the WSL2 DHCP IP, which also provides DNS resolution and the default route. No DHCP IP = no DNS = broken apt.

Claude: This is the WSL2 version of “I deleted production.” Except you’re deleting your own network stack.

Ubuntu Wasn’t Better

Tried Ubuntu with netplan:

==> vm1: Netplan configuration written with 1 static IP(s)
systemd-networkd is not running, output might be incomplete.
Failed to reload network settings: Unit dbus-org.freedesktop.network1.service not found.
Falling back to a hard restart of systemd-networkd.service

Same problem. netplan apply tries to restart systemd-networkd, which breaks WSL2’s network management.

NetworkManager: Different Problem

Fedora seemed promising - NetworkManager should handle multiple IPs gracefully, right?

Error: Connection activation failed: No suitable device found for this connection
(device eth0 not available because profile is not compatible with device
(permanent MAC address doesn't match)).

Ah. WSL2’s MAC address changes on every restart. NetworkManager stores the MAC in the connection profile and refuses to work when it doesn’t match.

Great.

The Systemd Service Revelation

At this point I had three different broken approaches for three different distros. Time to step back.

What do we actually need?

Add static IP to eth0
Keep WSL2’s DHCP IP intact (for DNS/routing)
Persist across reboots
Work on all distros

What if… we just don’t touch the native network config systems at all?

def write_systemd_static_ip_service(after_services, distro_name)
  # Universal method for ALL distros
  ip_commands = @static_ips.map { |ip_info|
    "ip addr add #{ip_info[:ip]}/#{ip_info[:prefix]} dev eth0 || true"
  }.join("\n")

  service_config = <<~SERVICE
    [Unit]
    Description=Vagrant Static IP Configuration
    After=#{after_services}
    Wants=network-online.target

    [Service]
    Type=oneshot
    RemainAfterExit=yes
    ExecStart=/bin/bash -c '#{ip_commands}'

    [Install]
    WantedBy=multi-user.target
  SERVICE

  # Write service file
  @machine.communicate.sudo("mv #{service_path} /etc/systemd/system/vagrant-static-ip.service")
  @machine.communicate.sudo("systemctl daemon-reload")
  @machine.communicate.sudo("systemctl enable vagrant-static-ip.service")
  @machine.communicate.sudo("systemctl start vagrant-static-ip.service")
end

A systemd oneshot service that:

Runs after network is up
Adds static IPs with ip addr add
Uses || true so it’s idempotent
Doesn’t restart anything
Doesn’t touch WSL2’s DHCP configuration

And here’s the beautiful part - it works on every distro because they all use systemd.

The Refactor

Wait, why are we even detecting distros? The whole point of the systemd service is that it’s universal. And network-online.target works on all of them.

def write_netplan_config
  # Universal solution for all distros - systemd service
  # No need to detect distro or network manager
  write_systemd_static_ip_service
end

That’s it. No detection. No branching. One implementation.

From ~160 lines of distro-specific code to ~45 lines of universal code. DRY for the win.

Testing: The SSH Background Process Bug

Integration test time. Need to test VM-to-VM communication. Python HTTP server seems perfect:

# Start HTTP server on vm1
vagrant ssh vm1 -c "python3 -m http.server 8080 --bind 192.168.50.10 > /dev/null 2>&1 &"

# Test from vm2
vagrant ssh vm2 -c "curl http://192.168.50.10:8080/"

Except… it doesn’t work. The & background process never starts.

Why? Because of how we encode commands:

def encode_command(command)
  encoded = Base64.strict_encode64(command)
  "echo '#{encoded}' | base64 -d | bash"
end

That pipe to bash is blocking. Even with & at the end of the command, the SSH session waits for bash to finish. And bash waits for the backgrounded process because… pipes.

Tried eval instead:

"eval \"$(echo '#{encoded}' | base64 -d)\""

But that breaks redirect parsing because of the double quotes.

Solution: Leave it as a known bug for now, use PowerShell jobs in the test to work around it:

$serverJob = Start-Job -ScriptBlock {
    vagrant ssh vm1 -c "python3 -m http.server 8080 --bind 192.168.50.10"
}
Start-Sleep -Seconds 3
$http_result = vagrant ssh vm2 -c "curl http://192.168.50.10:8080/"
Stop-Job $serverJob

Not elegant, but it works. The SSH command encoding is a problem for another day.

Claude: Translation: “I’ll fix this later” = “This will ship as-is”

The README: Managing Expectations

After all this, I wrote probably the most important documentation - the limitations README. Because this feature works , but it has constraints:

## WSL2 Networking Limitations

⚠️ **Important:** Private network support in WSL2 is experimental.

### Shared Network Infrastructure
- All WSL2 VMs share the same virtual network switch
- Same MAC address - every VM gets the same MAC on each WSL restart
- Shared base IP - all VMs share the same WSL2 DHCP IP
- IP visibility - you may see other VMs' static IPs on a single VM

### What This Means
**VM-to-VM Communication:** ⚠️ LIMITED
- VMs share the same physical NIC and MAC address
- Ping between VMs using static IPs does not work reliably
- TCP/UDP application traffic may work if routing is configured correctly

**Windows Host Access:** ❌ LIMITED
- Use port forwarding instead

**Process Isolation:** ✅ WORKS
- Each distribution runs in its own PID namespace

Being honest about limitations is better than users discovering them the hard way.

Lessons Learned

1. Don’t Fight the Platform

My first instinct was to use each distro’s native network config system. But WSL2 isn’t a traditional VM - it has its own quirks. Fighting those quirks with netplan/NetworkManager/etc. just created more problems.

The systemd service approach works with WSL2’s design instead of against it.

2. Sometimes “Good Enough” Is the Win

Perfect VM-to-VM networking in WSL2? Not possible. The architecture doesn’t support it.

But adding static IPs that survive reboots and work across distros? That’s achievable. And for development/testing use cases, it’s useful.

3. Documentation > Implementation

The most important code I wrote today was the README explaining why things don’t work perfectly. Setting expectations upfront saves everyone frustration later.

4. Integration Tests Reveal Real Problems

Writing the test exposed the SSH background process bug, the DNS issues, and the MAC address problem. All things that wouldn’t show up in manual testing.

Even though the test needed workarounds, it was still valuable.

What’s Next

The networking feature works across 6 distributions (Ubuntu, Debian, Fedora, AlmaLinux, Kali, openSUSE). It’s documented. It has tests (mostly).

But there are TODOs:

Fix the SSH command encoding for background processes
Maybe explore WSL2 mirrored networking mode (Windows 11 22H2+)
Test with more complex network scenarios

For now though, it’s good enough. Users can run multi-VM setups for testing, the limitations are clear, and the code is maintainable.

That’s a win.

Claude: Started the session thinking “quick feature add.” Ended it having refactored the entire networking implementation and written a philosophical README about WSL2’s limitations. Classic Sunday.

Try It

The multi-VM networking example is in the repo:

git clone https://github.com/LeeShan87/vagrant-wsl2-provider
cd examples/multi-vm-network
vagrant up

Check the README for the full list of limitations and workarounds. And maybe don’t expect VirtualBox-level networking - this is WSL2, after all.

Actual time spent: 4+ hours Lines of code written: ~300 Lines of code deleted: ~100 Times I questioned my life choices: Several Would I do it again: Probably

DEV Community