Guatu

Posted on Jun 27 • Originally published at guatulabs.dev

MetalLB on Bare Metal: LoadBalancer Without a Cloud Provider

#metallb #kubernetes #baremetal #networking

Deploy a Service of type LoadBalancer on a bare-metal cluster and kubectl get svc hands you a permanent <pending> in the EXTERNAL-IP column. No error, no event, no hint. The Service just sits there forever, waiting for an external IP that will never arrive, because nothing in your cluster is responsible for assigning one.

That gap is the entire reason MetalLB exists. On AWS or GCP, the cloud controller manager watches for LoadBalancer Services, calls the provider API, provisions an actual load balancer, and writes the IP back into the Service status. On bare metal there is no API to call. Kubernetes ships the abstraction but not the implementation, and you're left holding a Service type that does nothing until you supply the missing piece yourself.

Who should care

If you run Kubernetes on Proxmox VMs, on a stack of mini PCs, or on anything that isn't a managed cloud, you've hit this. The moment you try to expose ingress-nginx, Traefik, a database, or an LLM inference endpoint to the rest of your LAN, you need a stable IP that lives outside the cluster and routes in. MetalLB is the most common answer, and the config has changed enough across versions that a 2021 tutorial will actively mislead you. This walks through MetalLB v0.15.x, the L2-versus-BGP decision, and the connectivity traps that make an IP look assigned while traffic quietly goes nowhere.

What I reached for first (and why it was wrong)

Before MetalLB, the obvious-looking options all have a catch.

NodePort. It works, technically. You get a port in the 30000-32767 range on every node, and you can hit http://node-ip:31234. Then you try to use it for real and remember why nobody runs production this way: the ports are ugly, they're not 80/443, and you've now hardcoded a specific node's IP into whatever's calling the service. If that node reboots or drains, your "load balancer" is a single point of failure pointing at a machine that's gone.

hostNetwork on the ingress controller. Bind ingress-nginx directly to ports 80 and 443 on the host. This also works, and it's genuinely fine for a single-node setup. On a multi-node cluster it falls apart: now you need an external load balancer or DNS round-robin in front of the nodes to spread traffic, and you've moved the problem one layer up the stack instead of solving it. You also lose the clean LoadBalancer Service abstraction that everything else in the ecosystem expects.

Assuming spec.loadBalancerIP would just work. I set spec.loadBalancerIP: 10.0.0.210 on a Service, expecting Kubernetes to honor it. Nothing happened. That field assigns an IP only if a controller is watching for it and acts on the request. Without MetalLB or a cloud provider, it's an inert string. Worse, in newer MetalLB versions that field is deprecated in favor of an annotation, so even after installing MetalLB the old syntax silently does nothing.

The mistake underneath all three: treating type: LoadBalancer as a feature Kubernetes provides, when it's really a contract Kubernetes defines and expects something else to fulfill.

Installing MetalLB

MetalLB has two pieces. A controller Deployment handles IP allocation, watching Services and writing the assigned address into their status. A speaker DaemonSet runs on every node and is responsible for actually announcing the IP to the network, either via ARP (L2 mode) or BGP.

I install it with Helm so it fits a GitOps flow. If you run ArgoCD app-of-apps, this slots in as one more application.

helm repo add metallb https://metallb.github.io/metallb
helm install metallb metallb/metallb \
  --namespace metallb-system \
  --create-namespace \
  --version 0.15.3

Installing the chart does nothing visible on its own. MetalLB ships with no IP pool by default, which is deliberate: handing out addresses you didn't explicitly authorize would be a great way to start an IP conflict war with your DHCP server. You have to tell it which addresses it owns.

That configuration moved to CRDs back in v0.13. If a guide tells you to edit a ConfigMap named config in the metallb-system namespace, it predates that change and the syntax no longer applies. The two objects you need are IPAddressPool and L2Advertisement.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: lan-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.0.0.200-10.0.0.250  # carve this OUT of your DHCP range
  autoAssign: true

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: lan-l2
  namespace: metallb-system
spec:
  ipAddressPools:
    - lan-pool

The pool defines which addresses MetalLB may hand out. The L2Advertisement says "announce these addresses using Layer 2." Without the advertisement object, MetalLB will assign an IP from the pool but never tell the network it exists, which produces one of the most confusing failure states: a Service with an EXTERNAL-IP that nothing on the LAN can reach. The pool without the advertisement is half a configuration.

The address range matters. Those IPs must live in the same subnet as your nodes, and they must be outside whatever your DHCP server hands out. Pick a slice your router will never lease. A .200-.250 block at the top of a /24 is a common choice precisely because most DHCP pools stop well before there.

Apply both, deploy a test Service, and the <pending> finally resolves:

$ kubectl get svc -n ingress-nginx
NAME                 TYPE           EXTERNAL-IP   PORT(S)
ingress-nginx        LoadBalancer   10.0.0.200    80:31...,443:32...

L2 versus BGP: pick the simple one until you can't

MetalLB runs in one of two modes, and most homelab and small-cluster setups want Layer 2.

L2 mode uses ARP for IPv4 (NDP for IPv6). One speaker pod becomes the leader for a given Service IP and answers ARP requests for it. When a host on the LAN asks "who has 10.0.0.200?", that node replies with its own MAC, traffic arrives at the node, and kube-proxy forwards it to a backend pod. The whole thing needs zero cooperation from your router or switch. It looks like a normal host claiming an IP, which is exactly what it is.

BGP mode has each speaker peer with your router and advertise the Service IPs as routes. The router then load-balances across nodes via ECMP. This gives you real multi-node traffic distribution and faster failover, but it requires a router that speaks BGP and a willingness to manage routing config. That's a reasonable trade in a rack with a proper top-of-rack switch. In a homelab where the "router" is a consumer box, it's usually more than you want to take on.

	L2 (ARP/NDP)	BGP
Router requirement	None	Must speak BGP
Traffic distribution	One node per IP at a time	ECMP across nodes
Failover speed	Seconds (ARP re-announce)	Sub-second
Setup complexity	Low	Moderate to high
Best for	Homelabs, small clusters	Datacenter, high throughput

The honest caveat about L2: it is not true load balancing. For any single Service IP, all traffic lands on one node, the elected leader, and that node forwards internally. You get failover (if the leader dies, another speaker takes over and re-announces via gratuitous ARP) but not bandwidth aggregation. For a homelab serving an ingress controller or an Ollama API, that's completely fine. A single 10G node will saturate most home internet connections long before it becomes the bottleneck. If you genuinely need to spread inbound traffic across nodes, that's the line where BGP earns its complexity.

Why L2 mode actually works

The mechanism is worth understanding, because it explains every weird failure you'll hit later.

When the leader speaker announces a Service IP, it doesn't add the IP to a network interface in the usual sense. It answers ARP requests at the protocol level. A client on the LAN broadcasts an ARP query, the leading node responds with its MAC address, and the client's ARP table now maps the Service IP to that node's hardware address. Traffic flows to the node, hits iptables/IPVS rules installed by kube-proxy, and gets DNAT'd to a pod endpoint.

When the leader node goes down, the other speakers notice through MetalLB's memberlist-based election, a new leader is chosen, and it sends a gratuitous ARP. Gratuitous ARP is an unsolicited "hey, this IP is at my MAC now" broadcast that updates every device's ARP cache without them asking. That's the failover. It takes a few seconds for caches to update, which is why L2 failover is measured in seconds rather than milliseconds.

The reason this is all Layer 2 and not Layer 3 is also the reason for the biggest constraint: ARP doesn't cross subnet boundaries. A router won't forward an ARP request from one VLAN into another. So the Service IP, the nodes, and the client all have to sit on the same broadcast domain. The instant you want a Service IP that lives on a different VLAN than your nodes, plain L2 mode stops being enough, and you're into either BGP or careful per-interface advertisement.

The port-ownership trap

Here's the failure that eats the most debugging time, and it's the one tutorials never warn you about. The IP gets assigned. kubectl get svc shows the EXTERNAL-IP. You telnet the port and it connects. Everything says "working." But the actual application returns connection resets or timeouts, and you can't figure out why a connection that succeeds is also broken.

A successful TCP handshake to a MetalLB IP proves exactly one thing: a node owns the IP and accepted the connection. It says nothing about whether a healthy backend pod exists behind it. The classic cause is externalTrafficPolicy.

apiVersion: v1
kind: Service
metadata:
  name: ollama
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # the trap
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434

With externalTrafficPolicy: Local, traffic is only delivered to pods on the same node that received it, and the source IP is preserved. The problem: MetalLB elects a leader node to announce the IP, but if your single Ollama pod is scheduled on a different node, the leader has no local endpoint. Traffic arrives, finds nothing local to forward to, and gets dropped. The TCP connect still succeeds because the node owns the IP; the request just dies after that.

MetalLB is supposed to handle this by only announcing from nodes that have a ready endpoint, but the interaction with a single-replica workload pinned to one node is genuinely confusing the first time you meet it. The fix, when you don't need to preserve the client source IP, is externalTrafficPolicy: Cluster. That lets any node accept the traffic and forward it internally to wherever the pod actually runs, at the cost of an extra hop and a SNAT'd source address.

The lesson generalizes past MetalLB: stop treating "the port is open" as "the service works." Verify the full path.

# 1. Did MetalLB assign an IP from the pool?
kubectl get svc -A | grep LoadBalancer

# 2. Which node is announcing it? (look for the assigned IP)
kubectl logs -n metallb-system -l component=speaker | grep 10.0.0.200

# 3. From ANOTHER host on the same VLAN, does the IP answer ARP?
arping -I eth0 10.0.0.200

# 4. Does the backend actually respond, not just accept the connection?
curl -v http://10.0.0.200:11434/api/tags

That step 3 is the one people skip. arping confirms the IP is genuinely claimed at Layer 2 on your network, independent of anything Kubernetes thinks. If arping gets no reply but kubectl swears the IP is assigned, your speaker isn't announcing on the interface you expect, and you've found your problem before wasting an hour in application logs.

The hybrid VLAN reality

Most tutorials assume a flat network where every node has one NIC on one subnet. Real labs are segmented. You might want ingress traffic on one VLAN and management on another, with nodes carrying multiple interfaces.

MetalLB v0.15.x handles this through interface and node selectors on the L2Advertisement. You can scope a pool so it only ever announces on a specific NIC, which keeps an ingress IP from accidentally being advertised on your management network.

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: ingress-vlan
  namespace: metallb-system
spec:
  ipAddressPools:
    - ingress-pool
  interfaces:
    - eth1          # only announce on the ingress-facing NIC
  nodeSelectors:
    - matchLabels:
        node-role: ingress

This is where L2's same-subnet constraint bites hardest. Each pool's addresses still have to be reachable on the broadcast domain of the interface announcing them. You can segment which IPs go where, but you can't make ARP magically route across VLANs. If your design genuinely needs Service IPs that live on subnets your nodes don't touch, that's the signal to either bring those VLANs to the nodes as tagged interfaces or move to BGP and let the router do the routing it's built for.

MetalLB as the foundation, not the finish line

A MetalLB IP is rarely the whole story. It's the stable entry point everything else builds on. The typical flow looks like this:

External client → MetalLB IP → ingress controller Service → Ingress/IngressRoute → backend Service → pod.

MetalLB hands ingress-nginx or Traefik a single, stable LAN IP. Your ingress controller terminates TLS and routes by hostname to dozens of backends. You point a wildcard DNS record at that one MetalLB IP, and now *.lab.example.com resolves to your cluster. This is also the precondition for automated certificates: cert-manager with a DNS-01 solver needs your ingress reachable at a known address before it can prove ownership and issue certs. MetalLB provides the address; cert-manager provides the identity.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
spec:
  ingressClassName: nginx
  tls:
    - hosts: [ollama.lab.example.com]
      secretName: ollama-tls
  rules:
    - host: ollama.lab.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ollama
                port: { number: 11434 }

That Ingress doesn't reference a MetalLB IP directly, and that's the point. It targets the ingress controller's Service, which is the MetalLB-assigned IP. The same pattern extends upward to the Gateway API: a Gateway's listener address comes from a LoadBalancer Service, so gatewayAddresses in tools layered on top ultimately resolve to an IP MetalLB allocated. Get the bottom layer right and everything above inherits a stable address.

If you want a Service to keep a fixed IP across restarts, annotate it (the deprecated spec.loadBalancerIP is the trap here):

metadata:
  annotations:
    metallb.io/loadBalancerIPs: 10.0.0.210

What I'd do differently

A few things I wish I'd internalized before the first install.

Carve the IP pool out of DHCP on day one. The single most common MetalLB incident is an address conflict because the pool overlapped the DHCP lease range, and your router handed the same IP to a laptop. Decide the block, exclude it in your DHCP server, and document it. This is boring and it prevents a class of intermittent failures that are miserable to diagnose.

Default to externalTrafficPolicy: Cluster unless you have a specific reason to preserve source IPs. Source IP preservation matters for some applications, geo-logic, and rate limiting, but if you don't need it, Local only buys you the single-node-endpoint trap. Reach for Cluster first and switch to Local deliberately when a workload demands the real client IP.

Treat the arping check as a permanent part of your toolkit, not a one-time debug step. Separating "is the pod running" from "is the network path open" is the difference between an hour of log-reading and a thirty-second confirmation. MetalLB sits exactly at the seam between Kubernetes and your physical network, so when it misbehaves, you have to test both sides independently.

And keep the whole config in Git. The pool, the advertisement, and the per-Service annotations are small, declarative, and exactly the kind of thing that drifts when you edit it by hand at 11pm. Folding MetalLB into a GitOps repo alongside the rest of the cluster, the same way I treat Longhorn for bare-metal storage, means the networking layer is as reproducible as everything else. Bare-metal Kubernetes is mostly the work of rebuilding the conveniences a cloud provider gave you for free, one controller at a time. If you're standing up this kind of infrastructure and would rather not rediscover every trap yourself, that's the kind of work I do.

The surprise, looking back, was how little of the difficulty was MetalLB itself. The install is two Helm commands and two small CRDs. The hard part was everything around it: which subnet, which VLAN, which traffic policy, and the stubborn assumption that a connection accepting means a service working. Get those right and the <pending> that started all of this never comes back.

DEV Community