Deploying Envoy Gateway on AWS EKS: The Right Way

#kubernetes #aws #apigateway #gitops

Some context first: we were running on GKE Autopilot, where the Gateway API just works out of the box. Google manages the CRDs and the underlying load balancer controller for you create a Gateway, and it gets an external IP without you ever thinking about CRD lifecycle.

Moving that same ingress layer to EKS meant none of that was ready to use anymore. The first real decision wasn't about Envoy Gateway's configuration at all it was about how to install its CRDs without them colliding with the Gateway API CRDs, or with each other, during future migrations.

Installing the Gateway API CRDs

We start by installing the Gateway API CRDs first:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml

What this installs:

GatewayClass CRD
Gateway CRD
HTTPRoute CRD
ReferenceGrant CRD

Verify with:

kubectl get crd | grep gateway.networking.k8s.io

Installing Envoy Gateway's CRDs and Controller via ArgoCD

Next, install Envoy Gateway's own CRDs and the controller itself as two separate ArgoCD Applications, on two separate sync waves:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: envoy-gateway-crds
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
spec:
  project: default
  source:
    repoURL: oci://docker.io/envoyproxy
    chart: gateway-crds-helm
    targetRevision: v1.8.0
    helm:
      values: |
        crds:
          gatewayAPI:
            enabled: false       # Gateway API CRDs managed separately
            channel: experimental
          envoyGateway:
            enabled: true        # Only Envoy-specific CRDs
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=false
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: envoy-gateway
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  project: default
  source:
    repoURL: oci://docker.io/envoyproxy
    chart: gateway-helm
    targetRevision: v1.8.0
    helm:
      skipCrds: true             # CRDs managed by envoy-gateway-crds app
  destination:
    server: https://kubernetes.default.svc
    namespace: envoy-gateway-system
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=true

If you're not using ArgoCD, the equivalent Helm commands are:

helm install envoy-gateway-crds oci://docker.io/envoyproxy/gateway-crds-helm \
  --version v1.8.0 \
  --namespace default \
  --server-side \
  --set crds.gatewayAPI.enabled=false \
  --set crds.gatewayAPI.channel=experimental \
  --set crds.envoyGateway.enabled=true

helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm \
  --version v1.8.0 \
  --namespace envoy-gateway-system \
  --create-namespace \
  --skip-crds \
  --server-side

A few things matter here:

gatewayAPI.enabled: false the shared Gateway API CRDs (GatewayClass, Gateway, HTTPRoute, ReferenceGrant) aren't installed by this chart. They're installed once, separately, by their own Application, independent of any controller.
envoyGateway.enabled: true this chart installs only Envoy Gateway's own CRDs, including EnvoyProxy, on sync-wave -1, before the controller exists.
skipCrds: true on the envoy-gateway chart (wave 0) the controller deployment goes in after its CRDs already exist, and never touches CRD lifecycle itself.
ServerSideApply=true on both field-level ownership instead of whole-object ownership, so multiple Applications can touch overlapping CRDs without one overwriting the other.

Both Applications are templated as part of our cluster-bootstrap ApplicationSet, so every environment gets Envoy Gateway's CRDs-then-controller order automatically no manual sequencing per cluster for this part of the stack.

Architecture at a Glance

Internet
   │
   ▼
AWS NLB  (provisioned by AWS Load Balancer Controller)
   │
   ▼
Envoy Proxy Pods  (managed by Envoy Gateway, autoscaled by HPA)
   │
   ▼
Application Services  (via HTTPRoute rules)

The LoadBalancer Pending Trap

With the CRDs and controller in place, the next thing that breaks on a fresh EKS setup is the Service Envoy Gateway generates for its proxy deployment. By default, it's type LoadBalancer, and Kubernetes' in-tree cloud controller tries to provision a Classic Load Balancer for it.

On modern EKS clusters, that fails silently. No CLB gets created, no useful error appears in events, and the Service just sits at <pending> indefinitely.

The fix doesn't live on the Gateway object — Envoy Gateway generates its own Service internally, so there's nothing on Gateway.metadata to annotate. The fix has to go into the EnvoyProxy CRD, the same CRD installed separately in the -1 sync wave above, via envoyService.annotations:

envoyService:
  annotations:
    # Stops the in-tree CLB provisioner - AWS Load Balancer Controller
    # takes over and creates an NLB instead.
    service.beta.kubernetes.io/aws-load-balancer-type: "external"

    # Public-facing NLB. Use "internal" for private traffic only.
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"

    # Route NLB traffic directly to Pod IPs via VPC CNI -
    # bypasses kube-proxy and NodePort.
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

    # NLB health check hits Envoy's admin port. /healthz returns 200
    # only once Envoy is fully ready, so the NLB never routes to a
    # pod that's still starting or draining.
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "19002"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/healthz"

The single annotation that actually breaks the deadlock is aws-load-balancer-type: "external" it tells the in-tree controller to back off, and hands the Service to the AWS Load Balancer Controller, which then provisions a real NLB and writes its hostname back to gateway.status.addresses. The rest of the block (scheme, target type, health check) is what makes that NLB actually production-ready rather than just "not pending."

Putting It Together: GatewayClass, Gateway, and EnvoyProxy

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: eg
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: external-gateway
  namespace: gateway
spec:
  gatewayClassName: eg
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: external-proxy-config
  listeners:
    - name: http
      protocol: HTTP
      port: 80

And the EnvoyProxy CRD that ties resources, autoscaling, and the LB fix together in one place:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: external-proxy-config
  namespace: gateway
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        patch:
          type: StrategicMerge
          value:
            spec:
              template:
                spec:
                  containers:
                    - name: shutdown-manager
                      lifecycle:
                        preStop:
                          exec:
                            command: ["/bin/sh", "-c", "sleep 120"]
        container:
          resources:
            requests:
              cpu: 250m
              memory: 512Mi
            limits:
              memory: 1Gi

      envoyHpa:
        minReplicas: 1
        maxReplicas: 5
        metrics:
          - type: Resource
            resource:
              name: cpu
              target:
                type: Utilization
                averageUtilization: 60

      envoyService:
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-type: "external"
          service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "19002"
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/healthz"

Applying the Gateway triggers the chain reaction: Envoy Gateway creates the proxy deployment, the LoadBalancer Service with the annotations above, and an HPA — then AWS LBC provisions the NLB. HTTPRoute objects attach to the Gateway afterward and define per-service routing, owned by app teams.

Autoscaling Explained

The envoyHpa block creates a standard Kubernetes HPA against the proxy deployment. minReplicas: 1 keeps cost down during idle periods, at the cost of zero redundancy for ~15-30s if that pod dies. averageUtilization: 60 (150m of the 250m request) triggers scale-out early enough that new pods are healthy before latency degrades. For zero-downtime guarantees, minReplicas: 2 or 3 is the move.

Final Thoughts

Running Envoy Gateway on Amazon EKS isn't just about deploying another ingress controller — it's about understanding where the responsibilities are split.

Unlike managed Kubernetes offerings where the Gateway API experience is largely invisible, EKS gives you the flexibility to control every layer. That also means you own the lifecycle of the Gateway API CRDs, the Envoy Gateway CRDs, the controller installation, and the integration with the AWS Load Balancer Controller.

Separating CRDs from the controller, using ArgoCD sync waves to guarantee deployment order, and configuring the EnvoyProxy resource as the single place for infrastructure concerns makes the setup predictable and GitOps-friendly. It also avoids one of the most common migration issues: LoadBalancer Services remaining in a perpetual Pending state because the wrong controller is trying to provision them.