DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

The Foundation: AWS VPC CNI Deep Dive

Let’s remove the training wheels. We are going to break down exactly how these concepts operate mechanically under the hood, how bytes physically travel through an AWS environment, and what a Senior DevOps Engineer (6+ Years Experience) writes, debugs, and architects daily in production.

You cannot understand EKS Services without understanding how Pods get their IPs. Vanilla Kubernetes uses an "overlay network" (like Flannel or Calico vxlan), encapsulating packets inside packets. EKS does not do this by default. It uses the native AWS VPC CNI.

Under the Hood

Every Pod is a first-class citizen in your AWS VPC. It gets a real, routable IP address pulled directly from your AWS Subnet's CIDR block.

  • The aws-node DaemonSet: Runs on every worker node. It consists of two components: the CNI Plugin (which wires up network interfaces) and the IPAMD (IP Address Management Daemon).
  • Warm Pools: ipamd keeps a pool of Elastic Network Interfaces (ENIs) and secondary IPv4 addresses pre-attached to your EC2 worker nodes so that when a Pod schedules, it gets an IP instantly.
+--------------------------------------------------------------+
| Worker Node (EC2 Instance)                                   |
|  [Primary ENI (eth0)] -> Host Node IP (10.0.1.50)           |
|                                                              |
|  [Secondary ENI (eth1)]                                      |
|     |-- Secondary IP 1 -> Assigned to Pod A (10.0.1.61)      |
|     |-- Secondary IP 2 -> Assigned to Pod B (10.0.1.62)      |
+--------------------------------------------------------------+

Enter fullscreen mode Exit fullscreen mode

Senior Architectural Engineering: The IP Exhaustion Problem

Every EC2 instance size has a hard limit on how many ENIs and secondary IPs it can host. For example, a t3.medium can attach 3 ENIs, and each ENI can hold 6 IPs.

$$\text{Max Pods} = (\text{ENIs} \times (\text{IPs per ENI} - 1)) + 2$$

A t3.medium maxes out at 17 Pods. If your subnet is small (e.g., a /24), a few large nodes will completely consume your subnet's IP addresses, preventing scaling.

Senior Solutions implemented at 6+ years:

  1. Prefix Delegation: Instead of allocating individual secondary /32 IPs, ipamd allocates entire /28 blocks (16 IPs) to the ENI. This increases pod density per node dramatically (up to the K8s recommended 110 pods per node).
  2. Custom Networking: You configure the VPC CNI to assign Pod IPs from an entirely separate, non-routable secondary VPC CIDR block (e.g., 100.64.0.0/16 CGNAT space), saving your primary corporate subnet IPs for the actual EC2 nodes.

2. ClusterIP & kube-proxy Core Mechanics

When you define a ClusterIP service, Kubernetes creates a stable virtual IP address. But this IP does not exist on any physical network card. It is a ghost IP.

The Linux Kernel Data Path (iptables vs IPVS)

Every node runs a daemon called kube-proxy. It watches the Kubernetes API server for new Services and EndpointSlices (the real IPs of the backend pods matching your service selector).

[Pod A] ---> Tries to talk to ClusterIP (10.100.0.15:80)
                 |
          (Linux Kernel intercept via Netfilter)
                 |
     [iptables / IPVS Rules Engine]
                 |
     (Changes Destination IP via DNAT)
                 |
                 v
         [Pod B Real IP (10.0.1.62:8080)]

Enter fullscreen mode Exit fullscreen mode
  • iptables Mode (Default): kube-proxy writes sequential sequential O(N) evaluation rules inside the Linux kernel's Netfilter stack. When a packet leaves a pod targeting a ClusterIP, the kernel intercepts it, executes a DNAT (Destination Network Address Translation), and swaps the ClusterIP with a randomly selected healthy Pod IP.
  • The 6-Year Gotcha: At large scales (over 5,000 services), iptables causes massive CPU overhead because every single network packet must traverse a massive, sequential list of rules.
  • Production Fix: Senior engineers switch kube-proxy to IPVS (IP Virtual Server) mode. IPVS utilizes a Netfilter hash table O(1), allowing lookup times to remain completely flat regardless of how many thousands of microservices exist in the cluster.

3. NodePort: The Multi-Hop Bridge

A NodePort service allocates a port across every worker node (30000-32767).

The Hidden Packet Flow

If an external client hits Node-1-IP:32145, the traffic path looks like this:

  1. Packet arrives at Node 1.
  2. Node 1's iptables catches port 32145 and maps it internally to the corresponding ClusterIP.
  3. The rule randomly selects a backend pod. If that pod happens to live on Node 2, Node 1 performs an SNAT (Source NAT) and forwards the packet across the AWS network to Node 2.
  4. Node 2 delivers it to the Pod.

Senior Structural Problem: externalTrafficPolicy

Notice the extra network hop between Node 1 and Node 2. This increases latency and erases the client's real IP address (the pod sees Node 1's IP as the source).

Senior engineers modify the service manifest:

spec:
  type: NodePort
  externalTrafficPolicy: Local # <--- CRITICAL

Enter fullscreen mode Exit fullscreen mode
  • Local policy: Forces the node that receives the traffic to only route it to pods living on that exact same node. If no local pods exist, the packet is dropped. This preserves the original Client IP and removes the inter-node network hop.

4. Ingress & AWS Load Balancer Controller (Enterprise Tier)

An Ingress is a collection of Layer 7 (Application Layer) routing rules. In EKS, you deploy the AWS Load Balancer Controller, an open-source operator that sits in your cluster, watches for Ingress objects, and calls AWS APIs to create an Application Load Balancer.

Architectural Deep Dive: Target-Type Modes

A Senior Engineer carefully chooses between two design modes using annotations:

alb.ingress.kubernetes.io/target-type: instance

The ALB targets the EC2 worker nodes using a NodePort.

  • Path: Client $\rightarrow$ ALB $\rightarrow$ NodePort (EC2 Instance) $\rightarrow$ kube-proxy (iptables) $\rightarrow$ Pod IP.
  • Cons: Double hopping, higher latency, complex health checking.

alb.ingress.kubernetes.io/target-type: ip

The ALB bypasses the EC2 instances completely and targets the Pods directly. This is only possible because the AWS VPC CNI gives Pods real VPC IPs.

  • Path: Client $\rightarrow$ ALB $\rightarrow$ Pod IP directly.
  • Pros: Blazing fast, zero kube-proxy interference, cleaner health checks, direct traffic pattern.
[Internet Client]
       |
       v
  [AWS ALB]
       |
       +-----------------------+ (Target Type: IP)
       |                       |
       v                       v
[Pod 1 (10.0.1.61)]     [Pod 2 (10.0.1.62)]

Enter fullscreen mode Exit fullscreen mode

5. Egress Architechture & Security Boundaries

Managing outbound traffic is a massive part of auditing and compliance (PCI-DSS, SOC2).

The Infrastructure Layer

Pods live on nodes inside Private Subnets. When they call an external API (e.g., Salesforce, GitHub), the traffic passes from the Pod $\rightarrow$ ENI $\rightarrow$ Private Subnet Route Table $\rightarrow$ AWS NAT Gateway (living in a Public Subnet) $\rightarrow$ Internet.

  • The NAT Gateway maps the internal IP to a single public Elastic IP (EIP).

The Senior Level Layer-7 Security Problem

Standard Kubernetes NetworkPolicies operate at Layer 3/4 (IP and Port). They cannot inspect domain names. If a malicious dependency slips into your application code, it can easily exfiltrate data to a domain like malicious-attacker.com over standard port 443, bypassing standard network policies.

Senior Design Implementations:

  • Deploy Cilium utilizing eBPF to implement L7 network policies:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: restrict-egress-to-stripe
spec:
  endpointSelector:
    matchLabels:
      app: payment-processor
  egress:
  - toFQDNs:
    - matchName: "api.stripe.com" # <--- Only allow out to this domain
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP

Enter fullscreen mode Exit fullscreen mode

6. Real-World Troubleshooting Playbook for a Senior Engineer

When an application times out inside EKS, a Senior Engineer does not guess; they trace the network stack systemically.

                  [Is DNS Resolving?]
                     /          \
                (No) /            \ (Yes)
                    v              v
     Check CoreDNS Logs       [Can Pod contact ClusterIP?]
     Verify NodeLocal Cache       /          \
                             (No) /            \ (Yes)
                                 v              v
                     Check kube-proxy rules    Check Ingress / ALB Targets
                     Verify EndpointSlices     Verify Security Groups

Enter fullscreen mode Exit fullscreen mode

1. "My Ingress returns a 502 Bad Gateway"

  • Senior Action: Check the AWS ALB Target Group status via the AWS console or CLI. If targets are unhealthy, check the Kubernetes Pod Readiness Probes. If the container's readiness probe fails, the AWS Load Balancer Controller removes the Pod IP from the ALB Target Group, causing a 502.
  • Security Group Check: Ensure the Security Group attached to the ALB allows inbound traffic to the Worker Node/Pod security groups on the application port.

2. "Intermittent DNS Resolution Timeouts (5-second delays)"

  • Senior Action: This is a famous Linux kernel bug involving glibc tracking concurrent UDP requests (ndots problem).
  • Resolution: Deploy NodeLocal DNSCache as a DaemonSet to handle DNS lookup requests locally on the node via a loopback interface (169.254.20.10), cutting out connection tracking overhead entirely.

3. "The Pod can't connect to an AWS RDS Database outside the cluster"

  • Senior Action:
  • Run kubectl get pod -o wide to determine the Pod's actual IP.
  • Check the AWS Security Group assigned to the RDS instance. Ensure it allows ingress from the Pod's specific IP block (or the Security Group assigned directly to the Pod if using Security Groups for Pods via Branch ENIs).
  • Verify that the routing tables in the EKS node's subnets point correctly to the VPC subnets hosting the database.

Top comments (0)