daniel jeong

Posted on May 6 • Originally published at manoit.co.kr

Amazon EKS Hybrid Nodes Gateway Deep Dive: VXLAN, Cilium VTEP, and Lease-Based Leader Election Redefining Hybrid Kubernetes Networking

#kubernetes #cloud #devops #aws

Amazon EKS Hybrid Nodes Gateway Deep Dive — VXLAN, Cilium VTEP, and Lease-Based Leader Election Redefining Hybrid Kubernetes Networking in 2026

On April 28, 2026, alongside the OpenAI–Bedrock partnership announcement at the What's Next with AWS 2026 keynote, AWS quietly delivered one of the most operationally meaningful container updates of the year: the general availability of Amazon EKS Hybrid Nodes gateway. Since EKS Hybrid Nodes first shipped at re:Invent 2024, the single biggest operational tax on adopters has been a deceptively simple question: "How do we make on-premises pod CIDRs routable from the VPC?" That tax disappears with one Helm install. This post walks through the four-axis architecture of the new gateway, the AWS-maintained Cilium build with the CiliumVTEPConfig CRD, the VXLAN tunnel (VNI 2 / UDP 8472), the Kubernetes Lease-based leader election (3–5s failover), the automated VPC route table synchronization, IAM and CIDR design rules, and the parallel ECS announcements (EC2 Capacity Reservations integration and NLB Canary deployments) that landed in the same week — all from a hybrid-cluster operator's point of view.

1. Why the Gateway Is an Inflection Point — From 2024 GA to 2026 Gateway

EKS Hybrid Nodes went GA at re:Invent 2024 with a clear promise: "Manage cloud EC2 workers and on-prem bare metal under a single EKS control plane." The first year of adoption was rougher than the marketing implied. The biggest hurdle was networking. The AWS VPC CNI is incompatible with hybrid nodes, so operators had to deploy Cilium or Calico — and to make control-plane-to-webhook, EC2-pod-to-on-prem-pod, and ALB/NLB-to-on-prem-pod traffic work, they had to explicitly register on-prem pod CIDRs in VPC route tables, Transit Gateways, and Virtual Private Gateways.

That work created three persistent operational debts. First, every change to on-prem pod CIDRs required a coordinated change across VPC route tables, TGW, and VGW. Second, in some enterprises (finance, public sector) the routing policy simply forbids exposing pod CIDRs externally — which blocked EKS Hybrid Nodes adoption entirely. Third, BGP-level pod traffic exposure added a new monitoring and operational surface for IDC network teams.

The April 28, 2026 EKS Hybrid Nodes gateway pays down all three debts at once. The core decision is to stop trying to make pod CIDRs routable; instead encapsulate the traffic with VXLAN inside the VPC and carry it to the hybrid nodes as opaque payloads. The VPC no longer needs to know that on-prem pod CIDRs exist. It only needs to know one thing: "send traffic destined for these pod CIDRs to the gateway EC2 ENI."

Aspect	2024 EKS Hybrid Nodes (BGP model)	2026 Hybrid Nodes Gateway (VXLAN model)
On-prem pod CIDR exposure	Must register in VPC, TGW, VGW	Not required (VXLAN encapsulation)
VPC route table management	Manual or IaC-driven changes	Auto-synced by gateway
On-prem ↔ AWS routing	Requires BGP peering	UDP 8472 inbound/outbound is sufficient
Control plane → webhook	Routed via VGW/TGW	Encapsulated via gateway ENI
EC2 pod ↔ on-prem pod	Depends on BGP routing	Direct VXLAN tunnel
ALB/NLB → on-prem pod	Previously unsupported	Native, at no additional charge
Failover time	BGP reconvergence (tens of seconds)	Lease-based leader election: 3–5 seconds
Pricing	EKS Hybrid Nodes per-vCPU-hour	Gateway itself: no extra charge (only standard EC2/EKS fees)

One-line summary: hybrid Kubernetes is finally free of BGP governance.

2. The Four-Axis Architecture — VPC, Gateway, VXLAN, On-Prem

The gateway can be reasoned about as four axes. Splitting responsibilities along these lines makes security design, observability, and troubleshooting fall into place at once.

2.1 Axis 1: VPC Route Table — "Send pod-CIDR traffic to the gateway ENI"

At install time the operator provides a list of VPC route table IDs. The gateway controller pod automatically inserts entries that point on-prem pod CIDRs at the active gateway pod's ENI. These entries are written by the gateway pod's IAM role calling EC2 APIs directly — no IaC is involved.

# Auto-inserted route table entries (example)
Destination          Target                                          Status
10.200.0.0/16        eni-0abc1234... (active gateway pod)             active
10.201.0.0/16        eni-0abc1234...                                  active
# On failover the ENI target is updated to the new active ENI.
# On clean shutdown the routes are removed automatically.

Thanks to those routes, EC2 pods, ALBs/NLBs, and the EKS control plane ENIs (used for webhook calls) inside the VPC all have a deterministic path to the gateway whenever they target an on-prem pod IP.

2.2 Axis 2: Gateway EC2 — Active/Standby with Lease-Based Leader Election

The gateway is a Deployment of two pods. Both land on EC2 nodes labeled for gateway use (a dedicated node pool, managed node group, or self-managed nodes). Leader election uses a Kubernetes Lease object to decide which pod actively forwards traffic. Both Active and Standby create the hybrid_vxlan0 VXLAN interface at startup and run a node reconciler that watches CiliumNode CRs. Because both the VXLAN interface and the reconciler are pre-warmed on Standby, failover completes within 3–5 seconds when the Active pod dies.

Two operational implications follow. First, place the two gateway EC2 instances in different AZs. Second, choose a network-bandwidth-rich instance family (m6i/m7i large or above) — the gateway is the single path for all VPC-to-on-prem pod traffic.

2.3 Axis 3: VXLAN (VNI 2 / UDP 8472) — Cilium-Compatible by Default

VXLAN is implemented by the hybrid_vxlan0 interface inside the gateway EC2 instance. VNI 2 / UDP 8472 matches the Cilium default, so the gateway shares a data plane with the on-prem nodes' Cilium agents. When a new hybrid node registers, the gateway adds it as a remote VTEP, programming FDB entries, ARP entries, and routes on the VXLAN interface to complete the tunnel. Dynamic registration relies on the CiliumVTEPConfig CRD bundled with the AWS-maintained Cilium build — not present in upstream Cilium — which the gateway uses to register itself as the remote VTEP.

# CiliumVTEPConfig CR — auto-generated and managed by the gateway
apiVersion: cilium.io/v2alpha1
kind: CiliumVTEPConfig
metadata:
  name: eks-hybrid-gateway
  namespace: kube-system
spec:
  vtepEndpoints:
    - ip: 10.10.1.42         # active gateway ENI primary IP
      mac: "0a:1b:2c:3d:4e:5f"
      cidr: "10.0.0.0/16"    # VPC CIDR — encapsulate traffic for these pods
  vni: 2
  port: 8472                 # UDP 8472, Cilium default
  mtu: 1450                  # 50 bytes for VXLAN header
status:
  active: true
  lastReconciledAt: "2026-05-06T08:30:00Z"

2.4 Axis 4: On-Prem Nodes (Cilium VTEP Decoder)

Each on-prem node's Cilium agent reads the CiliumVTEPConfig and registers the gateway IP as a remote VTEP. When an encapsulated packet arrives, it strips the VXLAN header and routes inline to the destination pod. The reverse direction (on-prem pod → VPC) uses the same tunnel. Cilium unifies both directions into one data plane, so policies (NetworkPolicy / CiliumNetworkPolicy) apply consistently across the cluster.

3. Prerequisites — IAM, Security Groups, CIDR Design

Three pre-flight tasks must be completed before deployment. Skipping any one of them either blocks the gateway from updating the VPC routes or yields a cluster where packets reach the on-prem side but never come back.

3.1 Gateway IAM Permissions — Scoped via IRSA

The gateway pod must update its own node's ENI-pointing routes, which requires EC2 permissions. Bind the following policy via IRSA (IAM Roles for Service Accounts).

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DescribeRouteTables",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeRouteTables",
        "ec2:DescribeNetworkInterfaces"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ManagePodCidrRoutes",
      "Effect": "Allow",
      "Action": [
        "ec2:CreateRoute",
        "ec2:ReplaceRoute",
        "ec2:DeleteRoute"
      ],
      "Resource": [
        "arn:aws:ec2:ap-northeast-2:123456789012:route-table/rtb-aaaa1111",
        "arn:aws:ec2:ap-northeast-2:123456789012:route-table/rtb-bbbb2222"
      ],
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/eks-hybrid-nodes-gateway": "true"
        }
      }
    }
  ]
}

Design note: never use a wildcard for Resource. Pin the route-table ARNs the gateway is allowed to manage, and add a tag-based Condition so that only route tables explicitly tagged eks-hybrid-nodes-gateway=true are eligible. This contains the blast radius if anything goes wrong.

3.2 Security Groups and On-Prem Firewalls

VXLAN needs a single bidirectional UDP port (8472) open in both places.

Where	Direction	Protocol/Port	Source/Destination
Gateway EC2 SG	Inbound	UDP 8472	On-prem node IP CIDR
Gateway EC2 SG	Outbound	UDP 8472	On-prem node IP CIDR
On-prem firewall	Inbound/Outbound	UDP 8472	Gateway EC2 ENI IP range
EKS cluster SG	Inbound	TCP 443 / 10250	On-prem node IP CIDR (kubelet ↔ API)

3.3 CIDR Design — RFC-1918 / RFC-6598, Non-Overlapping

On-prem node and pod CIDRs must come from one of these ranges:

RFC-1918: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
RFC-6598 (CGNAT): 100.64.0.0/10

And they must not overlap with any of:

The VPC CIDR (e.g., 10.0.0.0/16)
The Kubernetes service CIDR (e.g., 10.100.0.0/16)
Each other (on-prem node CIDR ↔ on-prem pod CIDR)
Any peered or TGW-routed CIDR

Practical recommendation: pick 100.64.0.0/16 or 100.65.0.0/16 from RFC-6598 for on-prem pod CIDRs. They almost never collide with internal RFC-1918 corporate ranges, and since the VPC never has to know about them, you get extra freedom.

4. Deployment — One Helm Install, 30-Second Failover Test

The gateway ships as a Helm chart alongside the AWS-maintained Cilium build. Deployment is four steps.

4.1 Provision the Gateway Node Pool

# eksctl: dedicated gateway node group across two AZs
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: prod-hybrid
  region: ap-northeast-2
  version: "1.32"

managedNodeGroups:
  - name: gateway-pool
    instanceType: m7i.large       # generous network bandwidth
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
    availabilityZones: ["ap-northeast-2a", "ap-northeast-2c"]
    labels:
      role: hybrid-gateway
    taints:
      - key: hybrid-gateway
        value: "true"
        effect: NoSchedule
    privateNetworking: true

4.2 Install the AWS-Maintained Cilium

helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install cilium eks/cilium-eks \
  --namespace kube-system \
  --version 1.16.7-eks.1 \
  --set kubeProxyReplacement=true \
  --set tunnelProtocol=vxlan \
  --set ipv4NativeRoutingCIDR=100.64.0.0/16 \
  --set hybridNodes.enabled=true

4.3 Install the Gateway Components + Bind IRSA

helm upgrade --install eks-hybrid-gateway eks/eks-hybrid-nodes-gateway \
  --namespace kube-system \
  --version 1.0.0 \
  --set serviceAccount.create=true \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789012:role/eks-hybrid-gateway \
  --set nodeSelector.role=hybrid-gateway \
  --set tolerations[0].key=hybrid-gateway \
  --set tolerations[0].operator=Equal \
  --set tolerations[0].value=true \
  --set tolerations[0].effect=NoSchedule \
  --set vpcRouteTableIds="{rtb-aaaa1111,rtb-bbbb2222}" \
  --set podAntiAffinity.topologyKey=topology.kubernetes.io/zone

4.4 Verify — 30-Second Failover Test

# 1) Identify the leader gateway pod
kubectl -n kube-system get lease eks-hybrid-gateway -o yaml | grep holderIdentity

# 2) Confirm the auto-inserted VPC route entries
aws ec2 describe-route-tables \
  --route-table-ids rtb-aaaa1111 \
  --query "RouteTables[].Routes[?DestinationCidrBlock=='100.64.0.0/16']"

# 3) Reach an on-prem pod from a VPC pod
kubectl run curl --rm -it --image=curlimages/curl -- \
  curl -fsS http://100.64.0.50:8080/healthz

# 4) Failover test — kill the active pod
kubectl -n kube-system delete pod eks-hybrid-gateway-79bc6f5c5d-abcde
# Standby becomes leader within 3–5 seconds; ENI target on the VPC route flips automatically.

5. Traffic Matrix — Four Flows You Must Recognize

Once installed, the cluster carries traffic in four distinct patterns. Recognizing each pattern keeps policy, observability, and cost design coherent.

Source → Destination	Path	Encapsulation	Use Case
EKS control plane → on-prem pod (Webhook)	EKS ENI → VPC route → gateway ENI → VXLAN → on-prem pod	VXLAN	Admission webhook, aggregated APIServer
VPC EC2 pod → on-prem pod	EC2 ENI → VPC route → gateway ENI → VXLAN → on-prem pod	VXLAN	Microservice-to-microservice calls
On-prem pod → VPC EC2 pod	On-prem Cilium → VXLAN → gateway ENI → EC2 pod	VXLAN (reverse)	IDC workload calling cache/services next to RDS
ALB/NLB → on-prem pod	ALB/NLB → gateway ENI → VXLAN → on-prem pod	VXLAN	External user traffic reaching IDC GPU/licensed workloads

The fourth flow (ALB/NLB → on-prem pod) is the highest-impact change in 2026. Previously, sending external user traffic to an IDC workload meant placing a separate ALB/NLB inside the IDC or detouring via Direct Connect; now a single ALB can target both EC2 pods and IDC pods in the same target group. That unlocks two important workload classes — AI inference tied to GPU licenses anchored in the IDC, and compliance-bound data-processing that cannot leave the IDC — by letting them participate in the cloud-native ingress flow without extra infrastructure.

6. Operations — Metrics, Failover Scenarios, and Five Common Pitfalls

6.1 Metrics and Logs

The gateway exposes Prometheus metrics. The standard pattern is to scrape them with the OpenTelemetry Collector and ship to CloudWatch or Grafana.

# Prometheus scrape example
- job_name: eks-hybrid-gateway
  kubernetes_sd_configs:
    - role: pod
      namespaces:
        names: [kube-system]
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_label_app]
      action: keep
      regex: eks-hybrid-gateway
  metrics_path: /metrics
# Key metrics:
#   eks_hybrid_gateway_lease_holder{pod}             1=leader, 0=standby
#   eks_hybrid_gateway_vtep_count                    number of registered remote VTEPs
#   eks_hybrid_gateway_vxlan_packets_total{dir}      tx/rx packet counts
#   eks_hybrid_gateway_route_reconcile_errors_total  VPC route update failures

6.2 Four Failover Scenarios

Failure	Recovery Time	Operator Action
Active gateway pod crash	3–5s	None — Lease flips automatically
Active node OS crash	5–10s (Lease TTL + route update)	None — ASG replaces the node
VPC route update failure	Alert only; previous routes remain	Verify IAM Condition and route-table ARN scoping
UDP 8472 blocked between sites	Total cutoff	Inspect firewalls / VPN / Direct Connect ACLs

6.3 Five Common Pitfalls

Pod CIDR collision — even one bit of overlap with VPC CIDR breaks the routes. Run aws ec2 describe-vpcs and review every peering and TGW entry.
MTU 1450 missing — the VXLAN header eats 50 bytes; leaving MTU at 1500 yields fragmentation errors. Set tunnelMTU=1450 in the Cilium chart.
Unidirectional UDP 8472 — both the security group and the on-prem firewall must allow traffic in both directions. Half-open kills the handshake.
IAM Resource wildcard — letting the gateway touch arbitrary route tables is a foot-gun. Pin ARNs and use a tag Condition.
Missing anti-affinity — if both gateway pods land on the same node or AZ, failover loses meaning. Enforce topologyKey=topology.kubernetes.io/zone.

7. The Same Week: ECS Capacity Reservations and NLB Canary

The What's Next with AWS 2026 keynote also delivered two meaningful ECS updates that hybrid-cluster operators should review in the same quarter.

7.1 ECS Managed Instances + EC2 Capacity Reservations

ECS Managed Instance capacity providers gained a new option, capacityOptionType=reserved, allowing tasks to consume previously reserved EC2 capacity. Three strategies are available.

Strategy	Meaning	Recommended Use
`reservations-only`	Launch only into reserved capacity	License/GPU/BYOL workloads
`reservations-first`	Prefer reservations, fall back to on-demand	Predictable baseline plus spike handling
`reservations-excluded`	Never use reservations	Spot-heavy cost-optimized workloads

# Bind a Capacity Reservation to an ECS capacity provider
aws ecs create-capacity-provider \
  --name reserved-baseline \
  --auto-scaling-group-provider 'autoScalingGroupArn=arn:aws:autoscaling:...' \
  --managed-instances-provider '{
    "capacityOptionType": "reserved",
    "capacityReservationGroupArn": "arn:aws:resource-groups:ap-northeast-2:123456789012:group/cr-baseline",
    "reservationStrategy": "reservations-first"
  }'

7.2 ECS Linear/Canary Deployments on NLB

ECS already supported linear/canary on ALB; NLB was the gap. With the new release, NLB-backed services can shift traffic in fractional steps, integrated with CloudWatch alarms for automatic rollback when P99 latency or 5xx error rates breach thresholds.

# Apply NLB canary deployment policy to an ECS service
aws ecs update-service \
  --cluster prod \
  --service api-grpc \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {"enable": true, "rollback": true},
    "strategy": "CANARY",
    "stepWeights": [10, 25, 50, 100],
    "stepDuration": 300
  }'

Latency-critical workloads that require NLB — gRPC services, financial matching engines — finally get the same deployment freedom that ALB users have had for years.

8. A 4-Week Migration Checklist

The 4-week plan a hybrid-cluster operator can use to migrate an existing EKS 1.32 + IDC GPU node deployment to the new gateway:

Week	Work	Done When
Week 1	CIDR redesign, IAM role, security groups, Direct Connect ACL review	On-prem pod CIDR fixed in non-overlapping RFC-6598 range; IAM Resource ARN/tag scoped
Week 2	Deploy gateway pool + AWS-maintained Cilium + gateway chart in Dev	Lease holder, VTEP count, 3–5s failover verified; P99 RTT measured
Week 3	Run all four traffic flows end-to-end in Staging	Bidirectional reachability across all flows; consistent NetworkPolicy enforcement
Week 4	Prod cutover + retire BGP routing + apply ECS Capacity Reservations / NLB Canary	Manually-registered pod CIDRs removed from VPC route tables; cost & latency SLOs healthy

8.1 Cost Model

The gateway itself is free. Total cost is driven by three things: (1) the EC2 cost of two gateway instances (or managed node group equivalents); (2) data-transfer fees through the gateway ENIs; (3) the existing EKS Hybrid Nodes per-vCPU-hour fee. As long as you reuse an existing Direct Connect / VPN circuit, the gateway cuts operational burden while adding only the cost of two EC2 instances.

8.2 Recommended SLOs

VTEP registration latency: new hybrid node visible to the VTEP within 30 seconds (P95)
VXLAN packet drop rate: below 0.001% per minute
Gateway failover time: under 5 seconds (P99)
VPC route update failures: zero (critical alert)
ALB → on-prem pod RTT overhead: within +1–2 ms over Direct Connect baseline

9. Conclusion — Hybrid Kubernetes Has a New Default

As of May 2026, EKS Hybrid Nodes gateway settles two things. First, the default network model for hybrid Kubernetes is encapsulation, not BGP. Not exposing pod CIDRs externally shortens compliance and security review, and reduces IaC churn around route tables to zero. Second, ALB and NLB can now treat EC2 pods and IDC pods as members of the same target group — letting "external user → cloud → IDC GPU" flow through a single ingress for the first time. That is the thread that pulls license-bound and data-sovereignty-bound workloads back into a cloud-native posture without separate infrastructure.

For us at ManoIT, the next-quarter roadmap has three legs. First, walk Dev → Staging → Prod over four weeks and pin gateway metrics into the front row of our SRE dashboard. Second, shift equivalent ECS workloads to NLB Canary, wired to P99-latency-driven automatic rollback. Third, anchor GPU and licensed baselines on ECS Managed Instances + Capacity Reservations while keeping a Spot fallback strategy. When those three land, our container infrastructure becomes one platform with three faces — EKS, ECS, and hybrid — visible from a single operational surface.

This article was produced with assistance from AI (Claude) and reviewed for technical accuracy.

Originally published at ManoIT Tech Blog.

DEV Community