TL;DR
Switching to hostNetwork:true for your Kubernetes pods may seem like an effective solution to save IP addresses, but it can lead to unexpected and complex issues, especially with DNS traffic. We encountered scaling problems when hitting the packets-per-second (PPS) limit on the Elastic Network Interface (ENI) of our AWS nodes, resulting in DNS-related failures. Here’s how we debugged the problem, what we learned, and why this seemingly simple solution can create more trouble than it resolves.
Our Challenge: Scaling Issues Due to IP Address Exhaustion
In our Kubernetes cluster, we ran into scaling issues when our subnet ran out of IP addresses. Unfortunately, expanding the subnet was not an option as it required creating a new cluster—a disruptive and time-consuming process.
We chose to set hostNetwork:true for certain pods to address the issue. This configuration forces pods to share the same IP as the node they run on. Since the pods in question were clients only and not serving traffic, we thought unique IPs were unnecessary. Initially, everything seemed to work fine, but then we started noticing DNS-related errors in our logs.
Observing the Problem: DNS Rate-Limiting
After enabling hostNetwork:true, we began seeing DNS errors. With no increase in request volume and no other configuration changes, we hypothesized that these issues stemmed from DNS rate-limiting. We suspected three potential sources:
Node-Local DNS
CoreDNS
AWS DNS Resolver
We dove into the logs, DNS caches, and metrics to pinpoint the source of the problem. Despite adding a node-local DNS cache to reduce request volume, the errors persisted.
The Root Cause: AWS ENI PPS Limits
Through deeper investigation, we discovered that the issue wasn’t DNS rate-limiting per se but a hard limit on the packets per second (PPS) at the ENI level on AWS nodes.
Here’s how it works:
By default, the AWS VPC CNI plugin distributes pods across multiple ENIs on a node, balancing the network traffic.
When hostNetwork:true is enabled, all pods on that node share the primary ENI—the same network interface the node itself uses.
This configuration routes all DNS traffic through the node’s primary ENI, which has a hard PPS limit (1024 packets per second for many EC2 instance types).
Although the overall volume of DNS requests didn’t change, concentrating all traffic on the main ENI caused us to hit the PPS limit, leading to dropped packets and DNS failures.
What did we learn?
hostNetwork:true has significant side effects.
It bypasses the AWS VPC CNI’s pod-to-ENI balancing mechanism, potentially overloading the primary ENI with traffic.AWS ENI PPS limits are easy to overlook.
Each ENI has a hard PPS limit based on the instance type. Traffic exceeding this limit will be dropped, causing issues that can be hard to diagnose.DNS traffic optimization helps, but it’s not a panacea.
Adding node-local DNS caching reduced our DNS traffic, but the PPS issue persisted. The root cause was traffic distribution, not volume.Proper network planning is crucial.
Subnet exhaustion led us to make quick decisions that introduced complex issues. Expanding subnets early or adopting secondary CIDRs could have avoided this.
Conclusions and Recommendations
Avoid using hostNetwork:true as a quick fix for IP exhaustion.
It might solve the immediate problem but introduces new risks, especially related to network traffic concentration.Understand AWS ENI limits.
Be aware of the PPS limits for your instance types and how your CNI plugin manages traffic distribution.Plan for subnet scalability.
Use appropriately sized subnets or consider enabling additional CIDRs early to prevent IP exhaustion.Optimize DNS traffic early.
Node-local DNS caching is a good practice to reduce DNS traffic, but ensure it’s part of a broader strategy to manage network traffic effectively.
By addressing these challenges holistically, you can avoid the pitfalls we faced and ensure your Kubernetes clusters scale reliably without unexpected networking issues.
Final Thoughts
Networking in Kubernetes is complex, and solutions like hostNetwork:true can introduce subtle but impactful issues. We hope our story helps you understand the trade-offs and make better decisions in your own environments.
Top comments (0)