Ian Packard for Octasoft Ltd

Posted on Jan 31 • Originally published at wsl-ui.octasoft.co.uk

Service Mesh Adventures - Cilium, Istio Ambient, and the Ztunnel Saga

#kubernetes #istio #cilium #servicemesh

This is the final part of my homelab series. We've covered the WSL/Hyper-V architecture, bootstrap scripts, and GitOps with ArgoCD. Now let's talk about the networking stack - and the ztunnel certificate issue that haunted me for weeks.

Why Both Cilium AND Istio?

A fair question. Cilium is a CNI that can do service mesh things. Istio is a service mesh. Why run both?

Cilium handles L3/L4:

Pod networking and IP address management
Network policies
eBPF-based observability (Hubble)
Fast packet processing

Istio handles L7:

mTLS between services (automatic encryption)
Request-level routing (headers, paths, retries)
Traffic splitting for canary deployments
Distributed tracing

You can do L7 with Cilium (via Envoy), and you can do basic networking with Istio. But in my experience, letting each tool do what it does best gives the cleanest result.

Also: I wanted to learn Istio's ambient mode. Running it alongside Cilium gave me that opportunity without replacing my working CNI.

Istio Ambient Mode: No Sidecars

Traditional Istio injects an Envoy sidecar into every pod. It works, but:

Every pod needs extra CPU/memory for the sidecar
Sidecar injection can cause deployment issues
Debugging gets complicated with two containers per pod

Ambient mode takes a different approach. Instead of sidecars, it uses:

ztunnel: A per-node DaemonSet that handles L4 mTLS
waypoint proxies: Optional per-service L7 proxies (only where needed)

For my homelab, this means dramatically lower resource usage. Most services just need mTLS, not full L7 features, so ztunnel handles them without any sidecars.

# Enable ambient mode on a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: my-app
  labels:
    istio.io/dataplane-mode: ambient

That's it. Pods in that namespace automatically get mTLS via ztunnel.

CNI Chaining: Making Them Play Nice

Running Cilium and Istio together requires CNI chaining. Both want to configure networking, so they need to cooperate.

The setup:

Cilium installs its CNI config as 05-cilium.conflist
Istio CNI installs as ZZ-istio-cni.conflist (the "ZZ" ensures it loads after Cilium)
Istio CNI chains onto Cilium rather than replacing it

# From istiod Helm values
cni:
  enabled: true
  chained: true
  cniConfFileName: "ZZ-istio-cni.conflist"

The Istio CNI doesn't do packet routing - Cilium handles that. It just sets up the identity and interception needed for ambient mode.

The Ztunnel Certificate Nightmare

Everything worked beautifully. For about a week. Then services started failing with TLS errors.

The symptoms:

Pods could start but couldn't communicate
Logs showed certificate validation failures
Restarting pods temporarily fixed it
The problem came back

After much debugging, I found the culprit: ztunnel workload certificates were expiring.

Istio issues short-lived certificates (24 hours by default) to workloads. These should auto-renew. But in certain conditions - especially after VM suspend/resume cycles - ztunnel's certificate renewal would fail silently.

The certificates would expire, mTLS would break, and nothing could talk to anything.

The Workaround: Weekly Restart

I couldn't find a proper fix. The issue seems related to how ztunnel handles time jumps and certificate state after VM hibernation. Even with chrony fixing the clock, ztunnel's internal state was sometimes corrupt.

My solution is crude but effective: a CronJob that restarts ztunnel weekly.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ztunnel-restart
  namespace: istio-system
spec:
  schedule: "0 3 * * 0"  # Sunday at 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: kubectl
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl rollout restart daemonset/ztunnel -n istio-system
                  kubectl rollout status daemonset/ztunnel -n istio-system --timeout=300s
          restartPolicy: OnFailure
          serviceAccountName: ztunnel-restart-sa

Is this ideal? No. Does it work? Yes. The rolling restart refreshes certificates and clears any stale state. I haven't had a certificate-related outage since.

I also added alerts so I know if ztunnel is unhealthy between restarts:

# Prometheus alert rule
- alert: ZtunnelCertificateExpiringSoon
  expr: |
    istio_agent_cert_expiry_seconds{app="ztunnel"} < 3600
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Ztunnel certificate expiring soon"

Gateway API: The Modern Ingress

I use Gateway API instead of traditional Ingress resources. It's the future standard and works well with Istio.

The setup:

# Gateway definition
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: istio-ingress
spec:
  gatewayClassName: istio
  listeners:
    - name: http
      port: 80
      protocol: HTTP
      allowedRoutes:
        namespaces:
          from: All
    - name: https
      port: 443
      protocol: HTTPS
      tls:
        mode: Terminate
        certificateRefs:
          - name: homelab-tls-cert
      allowedRoutes:
        namespaces:
          from: All

Services expose themselves with HTTPRoutes:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: grafana
  namespace: monitoring
spec:
  parentRefs:
    - name: main-gateway
      namespace: istio-ingress
  hostnames:
    - "grafana.homelab.example.com"
  rules:
    - backendRefs:
        - name: grafana
          port: 3000

This is cleaner than Ingress annotations. Each service owns its routing configuration.

MetalLB: LoadBalancer on Bare Metal

The Gateway needs an external IP. On cloud providers, you'd get a LoadBalancer automatically. On bare metal (or a Hyper-V VM), you need MetalLB.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: homelab-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.100.200-192.168.100.220
  autoAssign: true

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: homelab-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - homelab-pool

The IP range is on the same network as the VM (192.168.100.0/24). MetalLB uses L2 advertisement (ARP) to announce these IPs. From WSL2, I can reach 192.168.100.200 (the Gateway's IP) directly.

Combined with a wildcard DNS record (*.homelab.example.com -> 192.168.100.200), every service gets its own hostname automatically.

TLS with Let's Encrypt

cert-manager handles TLS certificates automatically:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - dns01:
          route53:
            region: eu-west-2
            hostedZoneID: YOUR_ZONE_ID

I use DNS-01 challenges via Route 53. This works even though the services are private - Let's Encrypt validates DNS ownership, not HTTP reachability.

The Gateway's TLS certificate auto-renews every 60 days. No manual intervention needed.

The Full Networking Stack

Here's how a request flows:

Observability

With this stack, I get observability at multiple layers:

Cilium Hubble: L3/L4 network flows, DNS queries, dropped packets
Istio telemetry: L7 request rates, latencies, error rates
Grafana dashboards: Everything visualised

Hubble is particularly useful for debugging. You can see exactly which pods are talking to which services and whether traffic is being allowed or denied.

Lessons Learned

1. Ambient mode is promising but young. The ztunnel certificate issue is a real pain. I'm betting on upstream fixes, but for now, the weekly restart workaround is necessary.

2. CNI chaining works, but read the docs carefully. The order matters, the config file names matter, and debugging is harder when two CNIs are involved.

3. Gateway API is worth adopting. It's cleaner than Ingress, more expressive, and becoming the standard. Start using it now.

4. Start simple, add complexity later. I could have run just Cilium without Istio. Adding the service mesh was a learning exercise. For a production homelab, consider whether you actually need L7 features.

What's Next for This Homelab

Things I want to improve:

Multi-node cluster: Currently single-node. Adding worker nodes would let me test HA patterns and node failure scenarios.
Better alerting: The current setup has basic alerts. I want smarter alert routing and better runbooks.
Fix ztunnel properly: Keep watching upstream Istio for fixes to the certificate renewal issues.
ArgoCD multi-namespace: When ApplicationSets support multi-namespace properly, reorganise the GitOps structure.

Wrapping Up

This homelab journey started because I wanted to run Kubernetes on my Windows machine. I ended up with:

A Hyper-V VM because WSL2 networking doesn't support proper CNIs
WSL2 mirrored networking for seamless access
K3s with Cilium and Istio ambient mode
Full GitOps with ArgoCD's app-of-apps pattern
Automatic TLS with Let's Encrypt
Comprehensive observability

Is it over-engineered for a homelab? Probably. But I've learned a ton about Kubernetes networking, service meshes, and GitOps patterns. And I have a platform where I can deploy and test anything I'm working on.

If you're considering a similar setup, I hope this series helps you avoid some of the pitfalls I hit. Good luck, and may your certificates never expire unexpectedly.

This concludes the 4-part series on building a homelab Kubernetes setup on Windows. Thanks for reading!

Originally published at https://wsl-ui.octasoft.co.uk/blog/homelab-part-4-service-mesh