Korakrit Chariyasathian

Posted on Jan 5

On‑Premises Kubernetes Networking Architecture: A Comprehensive Architectural Synthesis

#architecture #kubernetes #networking

1. Context, Scope, and Design Objectives

This document presents a rigorous architectural synthesis of an on‑premises Kubernetes networking design implemented atop a Hyper‑V virtualization substrate. The objective is to construct a networking architecture that compensates for the absence of a cloud provider while remaining conceptually and operationally aligned with cloud‑native paradigms.

The design explicitly targets the following goals:

Operation in a fully on‑premises environment without reliance on a managed cloud provider
Functional support for Service objects of type LoadBalancer
Clear and enforceable separation of concerns among compute, networking, and ingress responsibilities
Immediate operational viability coupled with architectural compatibility for future autoscaling at both the pod and node levels
Conceptual portability, such that the underlying mental model remains valid upon eventual migration to a public cloud environment

The core infrastructural components considered herein are:

FRR Virtual Machine: An Ubuntu‑based virtual appliance running FRRouting, responsible for fabric‑level routing and control‑plane signaling
Kubernetes Worker Nodes: Linux virtual machines hosting workloads and participating in routing advertisement
MetalLB: Deployed in BGP mode to provide load‑balancer semantics in the absence of a cloud‑native implementation

2. Rationale for FRR and MetalLB in On‑Premises Kubernetes

2.1 Absence of a Cloud Provider and Its Consequences

In contrast to managed Kubernetes offerings, an on‑premises Kubernetes deployment lacks the following capabilities by default:

A managed Layer‑4/Layer‑7 load balancer
A managed routing plane capable of advertising service reachability
Automatic integration between Kubernetes service abstractions and the surrounding network fabric

Consequently, the Service abstraction of type LoadBalancer is inert unless explicitly backed by external infrastructure. The burden of implementing these capabilities therefore shifts to the platform architect.

2.2 MetalLB as a Functional Analog to Cloud Load Balancers

MetalLB is introduced to fill this infrastructural void by providing two essential capabilities:

Allocation of externally reachable virtual IP addresses for Kubernetes services
Advertisement of reachability for those IPs to the upstream network

MetalLB supports two operational modes:

Layer‑2 (L2) Mode, based on ARP/NDP announcements
Border Gateway Protocol (BGP) Mode, based on explicit route advertisement

This architecture deliberately adopts BGP mode, for reasons grounded in scalability, determinism, and alignment with Kubernetes’ distributed systems model.

3. Justification for BGP Mode over L2 Mode (Engineering Perspective)

3.1 Architectural Implications of L2 Mode

In L2 mode, MetalLB assigns ownership of a LoadBalancer IP to a single node. That node responds to ARP requests on behalf of the service, and failover is achieved indirectly via ARP cache invalidation.

From an architectural standpoint, this entails:

Tight coupling between a network identity (the service IP) and a specific node
Implicit, non‑contractual ownership semantics
Dependence on shared mutable state in the form of ARP caches

Such properties undermine predictability, complicate failure analysis, and conflict with Kubernetes’ expectation that nodes are ephemeral and interchangeable.

3.2 Architectural Advantages of BGP Mode

Under BGP mode, service reachability is expressed through explicit route advertisements. Nodes hosting service endpoints announce routes, and the routing fabric (FRR) selects viable forwarding paths. When a node becomes unavailable, its routes are withdrawn in a deterministic and protocol‑defined manner.

This yields several architectural advantages:

Explicit contracts governing reachability and ownership
Well‑defined state machines for convergence and failure handling
Decoupling of node lifecycle events from ingress semantics

The conceptual correspondence between Kubernetes primitives and BGP behavior is direct:

Kubernetes Event	BGP Effect
Pod scheduled	Route advertised
Pod terminated	Route withdrawn
Node added	BGP peer established
Node removed	BGP session torn down

This symmetry renders BGP mode a natural fit for Kubernetes’ distributed control model.

4. Network Roles and Separation of Responsibilities

4.1 FRR Virtual Machine as a Fabric Router

The FRR virtual machine functions as a stable fabric‑level routing component. Its responsibilities are intentionally constrained and explicitly defined.

It does not act as:

An Internet gateway
A NAT device
A default gateway for Kubernetes nodes

Instead, FRR serves exclusively as a fabric router, providing deterministic Layer‑3 reachability for Kubernetes Service and Pod traffic via BGP.

Network Interface Partitioning

The FRR VM is provisioned with two network interfaces, each mapped to a distinct routing domain:

eth0 (External – 192.168.1.0/24)
- Management access
- Outbound Internet connectivity (package installation, updates)
- Connectivity to the broader on‑premises LAN
eth1 (Internal – 10.10.0.0/24)
- Kubernetes fabric
- BGP peering with Kubernetes nodes
- Transport for Pod and LoadBalancer traffic

Critically, only internal fabric routes are advertised into BGP, preserving strict separation between management traffic and cluster data paths.

Netplan Configuration (FRR VM)

The following netplan configuration illustrates the canonical interface setup on the FRR VM:

# /etc/netplan/50-cloud-init.yaml
network:
  version: 2
  ethernets:
    eth0:
      dhcp4: false
      addresses:
        - 192.168.1.13/24
      routes:
        - to: default
          via: 192.168.1.1
      nameservers:
        addresses:
          - 1.1.1.1
          - 8.8.8.8

    eth1:
      dhcp4: false
      addresses:
        - 10.10.0.10/24

This configuration enforces a single default route via the external interface, ensuring that internal fabric traffic is never misrouted toward the management plane.

FRR (FRRouting) Configuration Overview

Within FRR, the routing policy is deliberately minimalistic and explicit:

router bgp 65001
 bgp router-id 10.10.0.10
 no bgp default ipv4-unicast

 address-family ipv4 unicast
  redistribute connected route-map INTERNAL_ONLY
 exit-address-family

Supporting policy objects restrict route advertisement to the Kubernetes fabric only:

ip prefix-list INTERNAL_NET seq 10 permit 10.10.0.0/24

route-map INTERNAL_ONLY permit 10
 match ip address prefix-list INTERNAL_NET

This ensures that:

External subnets (e.g., 192.168.1.0/24) are never advertised to Kubernetes nodes
FRR cannot be accidentally interpreted as a default or Internet gateway

4.2 Kubernetes Nodes as Disposable Compute Elements

Kubernetes nodes are treated strictly as ephemeral compute resources. They host workloads and participate in routing advertisements via MetalLB, but they do not retain long‑term ownership of ingress IPs.

Each node:

Establishes an explicit BGP peering relationship with FRR
Advertises LoadBalancer IPs only while hosting active service endpoints

This design reinforces the principle that nodes remain stateless with respect to ingress, enabling both horizontal pod autoscaling and future node‑level autoscaling without architectural refactoring.

5. FRR Configuration: Conceptual Overview

5.1 Governing Principles

The FRR configuration adheres to three guiding principles:

The routing fabric must be stable and long‑lived
Compute nodes must remain disposable
External and internal routing domains must be strictly segregated

5.2 Functional Responsibilities of FRR

Operate a BGP process under a dedicated autonomous system (AS 65001)
Accept peering relationships from Kubernetes nodes (AS 65002)
Advertise only internal fabric connectivity

5.3 On Explicit Peer Configuration

The requirement to explicitly configure BGP neighbors when adding new nodes is intrinsic to BGP’s design. This characteristic reflects protocol intentionality rather than architectural deficiency.

Crucially, this constraint does not impede future autoscaling; it merely indicates that automation of peer management has not yet been introduced.

6. MetalLB Configuration: Conceptual Overview

In BGP mode, MetalLB:

Allocates LoadBalancer IPs from a predefined pool
Advertises those IPs to FRR via BGP
Withdraws advertisements automatically upon failure or topology changes

MetalLB deliberately refrains from managing FRR configuration or dynamically creating BGP neighbors. This boundary enforces a clean separation between Kubernetes‑level intent and infrastructure‑level policy.

7. Autoscaling Readiness and Evolutionary Path

7.1 Pod‑Level Autoscaling

Horizontal Pod Autoscaling operates entirely within the Kubernetes control plane and is unaffected by the external routing architecture. The design fully supports HPA without modification.

7.2 Node‑Level Scaling: Present and Future

At present:

Nodes are provisioned manually
BGP peers are added manually
System behavior remains stable and predictable

In a future automated environment:

Node provisioning becomes orchestrated
BGP peering is automated or abstracted

Importantly, the network architecture itself remains unchanged; only the automation layer evolves.

8. Architectural Continuity Across Cloud Migration

On‑Premises Component	Cloud Analog
FRR VM	Managed cloud router / VPC router
MetalLB	Managed cloud LoadBalancer
Internal fabric	VPC subnet
BGP semantics	Provider‑managed routing

The architectural mental model therefore persists intact across deployment environments.

9. Architecture Diagram (Textual Representation)

        External LAN (192.168.1.0/24)
                 |
              [ eth0 ]
            +----------------+
            |     FRR VM     |
            |    AS 65001    |
            +----------------+
              [ eth1 ]
                 |
        Kubernetes Fabric (10.10.0.0/24)
        ---------------------------------
        |               |               |
    [ Node‑1 ]       [ Node‑2 ]     [ Node‑N ]
     AS 65002         AS 65002        AS 65002
        |               |               |
      Pods            Pods            Pods

10. Sequence Diagram: LoadBalancer Traffic Flow

@startuml
top to bottom direction

component Client
component "FRR\nBGP Speaker" as FRR
component "Kubernetes\nNode" as Node
component "Pod\nApplication" as Pod

Client --> FRR : TCP SYN to LoadBalancer IP
FRR --> Node : BGP-selected\nnext hop
Node --> Pod : Service routing\n(kube-proxy / dataplane)
Pod --> Node : Response payload
Node --> FRR : Return traffic
FRR --> Client : Forward response
@enduml

11. Concluding Observations

BGP mode is selected on the basis of architectural rigor rather than convenience
FRR serves as a stable routing substrate, not an application gateway
Nodes remain ephemeral and autoscaling‑compatible
LoadBalancer semantics closely mirror those of managed cloud environments
The design is simultaneously operationally sound today and structurally extensible for the future

This architecture prioritizes clarity, determinism, and long‑term evolutionary capacity—attributes essential to robust Kubernetes infrastructure at scale.

PlantUML Diagram

@startuml
skinparam backgroundColor #FFFFFF
skinparam shadowing false
skinparam componentStyle rectangle
skinparam defaultFontName Monospace

title Kubernetes on-prem with MetalLB (BGP) and FRR Fabric Router

'========================
' External Network
'========================
package "External LAN\n192.168.1.0/24\n(Management / Internet)" {

  node "Client\n192.168.1.2" as client

  node "is-kube-01\nControl Plane\neth0: 192.168.1.11" as kube01_ext
  node "is-kube-02\nWorker Node\neth0: 192.168.1.12" as kube02_ext

  node "FRR-VM\neth0: 192.168.1.13\n(No BGP here)" as frr_ext
}

'========================
' Internal Fabric
'========================
package "Kubernetes Internal Fabric\n10.10.0.0/24" {

  node "FRR-VM\nFabric Router\neth1: 10.10.0.1\nAS 65001" as frr_int

  node "is-kube-01\neth1: 10.10.0.11\nkubelet --node-ip\nAS 65002" as kube01_int
  node "is-kube-02\neth1: 10.10.0.12\nkubelet --node-ip\nAS 65002" as kube02_int

  component "MetalLB Speaker\n(on each node)" as metallb
}

'========================
' Kubernetes Objects
'========================
package "Kubernetes Cluster" {

  component "kube-apiserver\n(6443)" as apiserver
  component "kube-proxy\niptables / IPVS" as kubeproxy
  component "Pods\n(Containers)" as pods
}

'========================
' Management / Admin Path
'========================
client --> kube01_ext : SSH / HTTPS / kubectl
client --> kube02_ext : (optional admin)

kube01_ext --> apiserver : control-plane
apiserver --> kube01_ext
apiserver --> kube02_ext

'========================
' Routing Control Plane (BGP)
'========================
kube01_int --> frr_int : BGP (TCP 179)\nAdvertise LB IPs
kube02_int --> frr_int : BGP (TCP 179)\nAdvertise LB IPs

metallb --> kube01_int : Speaker binds\nInternalIP
metallb --> kube02_int

note right of frr_int
FRR role:
- BGP fabric router
- Learns LoadBalancer IPs
- No NAT
- No data forwarding
end note

'========================
' Data Plane (Service Traffic)
'========================
frr_int ..> kube01_int : Routing info only\n(NO packets)
frr_int ..> kube02_int : Routing info only\n(NO packets)

kube01_int --> kubeproxy
kube02_int --> kubeproxy
kubeproxy --> pods

note bottom of kubeproxy
Actual data path:
Client -> Node holding LB IP
kube-proxy forwards to Pod
FRR NOT in packet path
end note

'========================
' Separation of Concerns
'========================
note bottom
eth0 (192.168.1.x):
- Internet
- Admin
- OS routing / apt

eth1 (10.10.x.x):
- Kubernetes fabric
- BGP
- MetalLB
end note

@enduml

Author's Note

This document was prepared as a general guide and reference note only. The content was drafted with the assistance of an AI system and subsequently reviewed, refined, and curated by the author.

DEV Community