NTCTech

Posted on Apr 7 • Originally published at rack2cloud.com

Gateway API Is the Direction. Your Controller Choice Is the Risk.

#kubernetes #devops #cloudnative #platformengineering

Gateway API Kubernetes adoption is settled. The project has made its call — GA in 1.31, role-based model, the ecosystem is moving. That decision is not the hard part.

What isn't made — and what most guides skip entirely — is the controller decision that sits underneath it. Gateway API defines the routing model. It does not define what runs your traffic, how that component behaves under load, or what happens when it restarts in a cluster with five hundred routes and an incident already in progress. That's the controller decision. And it's where the architectural risk actually lives.

This post covers what the controller decision actually hinges on: failure modes, Day-2 behavior, and the operational tradeoffs that don't appear in comparison matrices.

Gateway API defines the model. Your controller choice determines the blast radius.

Gateway API Kubernetes: Why the Controller Decision Matters

Gateway API graduated to GA in Kubernetes 1.31. The role-based model — GatewayClass, Gateway, HTTPRoute — separates infrastructure concerns from application routing in a way the original Ingress API was never designed to do. For platform teams managing multi-tenant clusters, this separation is architecturally significant: app teams manage their HTTPRoutes, platform teams own the Gateway and GatewayClass, and the permission model is explicit rather than annotation-based.

The migration from Ingress to Gateway API is well-documented at the spec level. What's less documented is the operational delta between controllers that implement it. Two clusters running Gateway API with different controllers can behave completely differently under the same failure condition. The API is standardized. The runtime behavior is not.

The Fork That Matters: Ingress API vs Gateway API

Before the controller decision, the API model decision — because the two are not interchangeable and your controller selection is downstream of it.

The Ingress API (networking.k8s.io/v1) is stable, universally supported, and battle-tested. It handles HTTP/HTTPS routing with host and path matching. It also handles almost nothing else without controller-specific annotations — which is where the operational debt starts accumulating in year two and compounds quietly through year five.

The Gateway API is the successor — graduated to GA in Kubernetes 1.31. Typed resources, explicit cross-namespace permission grants via ReferenceGrant, expressive routing rules that live in version-controlled manifests rather than annotation strings. For new clusters, it is the correct default. For existing clusters with years of Ingress annotations in production, migration has a cost that needs to be planned rather than assumed away.

Pick the API model first. The controller decision follows from it — not the other way around.

Where Kubernetes Ingress Controllers Actually Fail

The ingress-nginx deprecation path has pushed a lot of teams into controller evaluation mode. Most of that evaluation happens at the feature level. Here's what happens at the operational level.

Failure Mode 01 — Reload Storms Under Churn

NGINX-based controllers reload the worker process on every configuration change. In stable clusters this is invisible. In clusters with aggressive autoscaling or frequent deployments, reload frequency produces tail latency spikes, dropped WebSocket connections, and gRPC stream interruptions that don't correlate cleanly with any deployment event.

Failure Mode 02 — Annotation Sprawl & Config Drift

The Ingress API handles basic routing. Everything else — rate limiting, authentication, upstream keepalive, CORS, proxy buffer tuning — lives in controller-specific annotations. In year one this is manageable. By year three, annotation blocks are copied without being understood, controller upgrades become change management exercises, and no one owns the full picture.

Failure Mode 03 — TLS & cert-manager Edge Cases

cert-manager is nearly universal in production Kubernetes. Its interaction with ingress controllers is a reliable source of subtle failures — certificate renewal triggers a resource update, the controller reloads, and a short window of stale certificate serving opens. Normally sub-second. Under ACME rate limiting or slow reload paths, the window extends and you get TLS handshake failures with no clean correlated deployment event.

Failure Mode 04 — Cold-Start Reconciliation Window

Ingress controllers are not stateless in practice. On restart they must reconcile all Ingress or HTTPRoute resources before serving traffic correctly. In clusters with hundreds of route objects, this window is non-trivial — and if readiness probes are configured to the process start rather than reconciliation completion, rolling updates and node evictions become incidents.

None of these failure modes appear in controller documentation. All of them will surface in production. The Kubernetes Day-2 incident patterns follow a consistent shape: the configuration was correct, the failure mode was structural, and it only became visible under the specific load condition that triggers it.

Reload-Based vs Dynamic Configuration: The Architectural Fork

The reload vs dynamic configuration distinction is the most operationally significant difference between controller architectures — more significant than any feature comparison.

NGINX-based controllers reload the worker process on configuration changes. The reload is fast — typically under 100ms. At low frequency: invisible. At 50–100 reloads per hour from a cluster with aggressive HPA configurations or high deployment velocity, the cumulative effect on tail latency and persistent connections is real. Monitor nginx_ingress_controller_config_last_reload_successful and reload frequency before this becomes a production problem.

Envoy-based controllers — Contour, Istio's gateway, and AWS Gateway Controller — use xDS dynamic configuration delivery. Route changes propagate without process restart. For clusters with high pod churn or KEDA-driven autoscaling, this is architecturally significant rather than a preference. The autoscaler choice and the ingress controller choice have a dependency that most teams don't map until they're debugging correlated latency spikes.

Resource requests and limits on ingress controller pods are not a secondary concern. An under-resourced controller pod that gets OOM-killed or throttled under burst load is a full ingress outage. Size the controller like it's critical infrastructure, because it is.

Controller Decision: Operational Tradeoffs by Profile

Controller	Config Model	Gateway API	Best Fit	Watch For
ingress-nginx (community)	Reload on change	Partial	Stable clusters, Ingress API incumbents	Reload storms under HPA churn
NGINX Inc. (nginx-ingress)	Hot reload (NGINX Plus)	Partial	Enterprise with NGINX support contracts	License cost, annotation parity gaps
Contour	Dynamic xDS	Native (GA)	New clusters, Gateway API-first	Smaller ecosystem, fewer extensions
Traefik	Dynamic	Beta	Dev/staging, operator-heavy envs	Gateway API maturity, CRD proliferation
AWS LB Controller	ALB/NLB native	Yes	EKS-only, AWS-native workloads	Hard AWS lock-in, ALB cost at scale
Istio Gateway	Dynamic xDS	Native	Existing service mesh deployments	Operational complexity, sidecar overhead

The service mesh vs eBPF tradeoff determines whether your ingress and east-west traffic share a unified data plane — and that decision has operational weight that shows up during incident response, not during initial deployment.

The Three Questions the Decision Actually Hinges On

What is your cluster's churn rate? Count your Ingress-triggering events per hour: HPA scale events, deployments, cert renewals, configuration changes. If that number is high and climbing, reload-based controllers carry real operational risk. The 502 and MTU debugging patterns that show up in ingress troubleshooting often trace back to reload timing under load rather than configuration errors.

Where does your annotation investment live? If you have years of Ingress annotations encoding routing logic across hundreds of resources, the Gateway API migration cost is real. Run that migration when you're doing a platform modernization anyway — not as a standalone project.

Who operates this at 2 AM? A controller that a three-person platform team can debug during an incident is better than a technically superior controller no one fully understands. The platform engineering model puts ingress in the platform team's operational domain — the controller needs to fit their observability stack, runbook model, and on-call capability.

The Day-2 Checklist Nobody Ships With

Before a controller goes to production, answer these:

[ ] What is the controller's behavior during a rolling update — and is there a zero-downtime upgrade path documented for your version?
[ ] How does it handle TLS certificate rotation under sustained load? Is the stale-cert serving window measured?
[ ] What metrics does it expose natively, and what requires custom instrumentation? Is reload frequency in your alerting stack?
[ ] What is the reconciliation time from cold start with your current route object count? Has this been measured — not estimated?
[ ] Is a PodDisruptionBudget configured, and does it account for the reconciliation window — not just process start?
[ ] What breaks first if the controller pod is evicted under node memory pressure? Is that failure mode in your runbook?
[ ] If you're running a service mesh — is the ingress controller in or out of the mesh data plane, and is that decision explicit?

The containerd Day-2 failure patterns and these ingress failure modes share a structural similarity: invisible during initial deployment, compounding under real production load, surfacing at the worst possible time.

Architect's Verdict

Gateway API is the correct architectural direction for new Kubernetes clusters in 2026. That decision is settled. The controller decision underneath it is not — and it carries more operational risk than the API model choice does.

For new infrastructure: Gateway API Kubernetes with Contour is the defensible default. The API is GA, the xDS-based configuration model eliminates reload risk, and you avoid accumulating annotation debt from day one. On EKS, the AWS Load Balancer Controller is the pragmatic choice if you're already committed to the AWS networking model — with the understanding that you are accepting the lock-in that comes with it.

For existing clusters on ingress-nginx: don't migrate for migration's sake. The ingress-nginx deprecation path has four documented options — evaluate them against your actual cluster profile, not the general recommendation.

Either way: measure your reload rate before it becomes a problem. Configure readiness probes against reconciliation completion, not process start. Don't assume cert-manager and your controller share the same definition of "ready." These failure modes are predictable. The only variable is whether they surface in your testing environment or in production during an incident.

Part of the Kubernetes Ingress Architecture Series on Rack2Cloud. Originally published at rack2cloud.com.

DEV Community