This is the final part of my homelab series. We've covered the WSL/Hyper-V architecture, bootstrap scripts, and GitOps with ArgoCD. Now let's talk about the networking stack - and the ztunnel certificate issue that haunted me for weeks.
Why Both Cilium AND Istio?
A fair question. Cilium is a CNI that can do service mesh things. Istio is a service mesh. Why run both?
Cilium handles L3/L4:
- Pod networking and IP address management
- Network policies
- eBPF-based observability (Hubble)
- Fast packet processing
Istio handles L7:
- mTLS between services (automatic encryption)
- Request-level routing (headers, paths, retries)
- Traffic splitting for canary deployments
- Distributed tracing
You can do L7 with Cilium (via Envoy), and you can do basic networking with Istio. But in my experience, letting each tool do what it does best gives the cleanest result.
Also: I wanted to learn Istio's ambient mode. Running it alongside Cilium gave me that opportunity without replacing my working CNI.
Istio Ambient Mode: No Sidecars
Traditional Istio injects an Envoy sidecar into every pod. It works, but:
- Every pod needs extra CPU/memory for the sidecar
- Sidecar injection can cause deployment issues
- Debugging gets complicated with two containers per pod
Ambient mode takes a different approach. Instead of sidecars, it uses:
- ztunnel: A per-node DaemonSet that handles L4 mTLS
- waypoint proxies: Optional per-service L7 proxies (only where needed)
For my homelab, this means dramatically lower resource usage. Most services just need mTLS, not full L7 features, so ztunnel handles them without any sidecars.
# Enable ambient mode on a namespace
apiVersion: v1
kind: Namespace
metadata:
name: my-app
labels:
istio.io/dataplane-mode: ambient
That's it. Pods in that namespace automatically get mTLS via ztunnel.
CNI Chaining: Making Them Play Nice
Running Cilium and Istio together requires CNI chaining. Both want to configure networking, so they need to cooperate.
The setup:
- Cilium installs its CNI config as
05-cilium.conflist - Istio CNI installs as
ZZ-istio-cni.conflist(the "ZZ" ensures it loads after Cilium) - Istio CNI chains onto Cilium rather than replacing it
# From istiod Helm values
cni:
enabled: true
chained: true
cniConfFileName: "ZZ-istio-cni.conflist"
The Istio CNI doesn't do packet routing - Cilium handles that. It just sets up the identity and interception needed for ambient mode.
The Ztunnel Certificate Nightmare
Everything worked beautifully. For about a week. Then services started failing with TLS errors.
The symptoms:
- Pods could start but couldn't communicate
- Logs showed certificate validation failures
- Restarting pods temporarily fixed it
- The problem came back
After much debugging, I found the culprit: ztunnel workload certificates were expiring.
Istio issues short-lived certificates (24 hours by default) to workloads. These should auto-renew. But in certain conditions - especially after VM suspend/resume cycles - ztunnel's certificate renewal would fail silently.
The certificates would expire, mTLS would break, and nothing could talk to anything.
The Workaround: Weekly Restart
I couldn't find a proper fix. The issue seems related to how ztunnel handles time jumps and certificate state after VM hibernation. Even with chrony fixing the clock, ztunnel's internal state was sometimes corrupt.
My solution is crude but effective: a CronJob that restarts ztunnel weekly.
apiVersion: batch/v1
kind: CronJob
metadata:
name: ztunnel-restart
namespace: istio-system
spec:
schedule: "0 3 * * 0" # Sunday at 3 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl rollout restart daemonset/ztunnel -n istio-system
kubectl rollout status daemonset/ztunnel -n istio-system --timeout=300s
restartPolicy: OnFailure
serviceAccountName: ztunnel-restart-sa
Is this ideal? No. Does it work? Yes. The rolling restart refreshes certificates and clears any stale state. I haven't had a certificate-related outage since.
I also added alerts so I know if ztunnel is unhealthy between restarts:
# Prometheus alert rule
- alert: ZtunnelCertificateExpiringSoon
expr: |
istio_agent_cert_expiry_seconds{app="ztunnel"} < 3600
for: 5m
labels:
severity: warning
annotations:
summary: "Ztunnel certificate expiring soon"
Gateway API: The Modern Ingress
I use Gateway API instead of traditional Ingress resources. It's the future standard and works well with Istio.
The setup:
# Gateway definition
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: main-gateway
namespace: istio-ingress
spec:
gatewayClassName: istio
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: All
- name: https
port: 443
protocol: HTTPS
tls:
mode: Terminate
certificateRefs:
- name: homelab-tls-cert
allowedRoutes:
namespaces:
from: All
Services expose themselves with HTTPRoutes:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: grafana
namespace: monitoring
spec:
parentRefs:
- name: main-gateway
namespace: istio-ingress
hostnames:
- "grafana.homelab.example.com"
rules:
- backendRefs:
- name: grafana
port: 3000
This is cleaner than Ingress annotations. Each service owns its routing configuration.
MetalLB: LoadBalancer on Bare Metal
The Gateway needs an external IP. On cloud providers, you'd get a LoadBalancer automatically. On bare metal (or a Hyper-V VM), you need MetalLB.
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: homelab-pool
namespace: metallb-system
spec:
addresses:
- 192.168.100.200-192.168.100.220
autoAssign: true
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: homelab-advertisement
namespace: metallb-system
spec:
ipAddressPools:
- homelab-pool
The IP range is on the same network as the VM (192.168.100.0/24). MetalLB uses L2 advertisement (ARP) to announce these IPs. From WSL2, I can reach 192.168.100.200 (the Gateway's IP) directly.
Combined with a wildcard DNS record (*.homelab.example.com -> 192.168.100.200), every service gets its own hostname automatically.
TLS with Let's Encrypt
cert-manager handles TLS certificates automatically:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: your-email@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- dns01:
route53:
region: eu-west-2
hostedZoneID: YOUR_ZONE_ID
I use DNS-01 challenges via Route 53. This works even though the services are private - Let's Encrypt validates DNS ownership, not HTTP reachability.
The Gateway's TLS certificate auto-renews every 60 days. No manual intervention needed.
The Full Networking Stack
Here's how a request flows:
Observability
With this stack, I get observability at multiple layers:
- Cilium Hubble: L3/L4 network flows, DNS queries, dropped packets
- Istio telemetry: L7 request rates, latencies, error rates
- Grafana dashboards: Everything visualised
Hubble is particularly useful for debugging. You can see exactly which pods are talking to which services and whether traffic is being allowed or denied.
Lessons Learned
1. Ambient mode is promising but young. The ztunnel certificate issue is a real pain. I'm betting on upstream fixes, but for now, the weekly restart workaround is necessary.
2. CNI chaining works, but read the docs carefully. The order matters, the config file names matter, and debugging is harder when two CNIs are involved.
3. Gateway API is worth adopting. It's cleaner than Ingress, more expressive, and becoming the standard. Start using it now.
4. Start simple, add complexity later. I could have run just Cilium without Istio. Adding the service mesh was a learning exercise. For a production homelab, consider whether you actually need L7 features.
What's Next for This Homelab
Things I want to improve:
- Multi-node cluster: Currently single-node. Adding worker nodes would let me test HA patterns and node failure scenarios.
- Better alerting: The current setup has basic alerts. I want smarter alert routing and better runbooks.
- Fix ztunnel properly: Keep watching upstream Istio for fixes to the certificate renewal issues.
- ArgoCD multi-namespace: When ApplicationSets support multi-namespace properly, reorganise the GitOps structure.
Wrapping Up
This homelab journey started because I wanted to run Kubernetes on my Windows machine. I ended up with:
- A Hyper-V VM because WSL2 networking doesn't support proper CNIs
- WSL2 mirrored networking for seamless access
- K3s with Cilium and Istio ambient mode
- Full GitOps with ArgoCD's app-of-apps pattern
- Automatic TLS with Let's Encrypt
- Comprehensive observability
Is it over-engineered for a homelab? Probably. But I've learned a ton about Kubernetes networking, service meshes, and GitOps patterns. And I have a platform where I can deploy and test anything I'm working on.
If you're considering a similar setup, I hope this series helps you avoid some of the pitfalls I hit. Good luck, and may your certificates never expire unexpectedly.
This concludes the 4-part series on building a homelab Kubernetes setup on Windows. Thanks for reading!
Originally published at https://wsl-ui.octasoft.co.uk/blog/homelab-part-4-service-mesh




Top comments (0)