During the DNS investigation, I initially focused on CoreDNS and NodeLocal DNS metrics.
The real breakthrough came when I started correlating DNS failures with AWS instance-level network limits.
The most useful signals came from network allowance metrics exposed by the EC2 ENA driver via ethtool.
AWS Network Allowance Metrics
The following metrics represent network limits enforced at the EC2 instance level.
ethtool_linklocal_allowance_exceeded
Packets dropped because traffic to link-local services exceeded the packets-per-second (PPS) limit.
This directly affects DNS, IMDS, and Amazon Time Sync. If you found this value above zero, you can 1) try increasing the number of CoreDNS replicas or 2) implement NodeLocal DNSCache or 3) check the ndots as mentioned in this post The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)ethtool_conntrack_allowance_available
Remaining number of connections that can be tracked before reaching the instance’s connection-tracking limit.
Supported on Nitro-based instances only.ethtool_conntrack_allowance_exceeded
Packets dropped because the connection-tracking limit was exceeded and new connections could not be established.ethtool_bw_in_allowance_exceeded
Packets queued or dropped because inbound aggregate bandwidth exceeded the instance limit.ethtool_bw_out_allowance_exceeded
Packets queued or dropped because outbound aggregate bandwidth exceeded the instance limit.ethtool_pps_allowance_exceeded
Packets queued or dropped because the bidirectional packets-per-second (PPS) limit was exceeded.
All
*_allowance_exceededmetrics should ideally remain zero.
Any sustained non-zero value indicates a networking bottleneck at the instance level.
For all metrics except the link-local you can solve it by changing the instance size or type to get a higher network bandwidth or work on reducing the load on this instance.
Capturing Network Metrics in EKS
These metrics are exposed by the EC2 ENA driver via ethtool, collected by node exporter, scraped by Prometheus, and visualized in Grafana.
On Amazon Linux EKS nodes, ethtool is installed by default.
To collect these metrics, the ethtool collector must be enabled in node exporter.
Enable ethtool Collector in node exporter
Add the following arguments to the node exporter container.
containers:
- args:
- --collector.ethtool
- --collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+
- --collector.ethtool.metrics-include=.*
After applying this change, the metrics will become available in Prometheus and Grafana.
Building the Grafana Dashboard
All panels are time series panels, built per node, to help correlate network saturation with DNS errors or latency.
Available Connection Tracking Capacity
The metric exported by node exporter is:
node_ethtool_conntrack_allowance_available
It represents the current number of connections that can still be tracked on each node.
PromQL query:
node_ethtool_conntrack_allowance_available{job="node-exporter"}
Packets Dropped Due to Conntrack Exhaustion
The metric node_ethtool_conntrack_allowance_exceeded is a counter that increases over time.
To calculate packet drops per second, use the rate() function.
sum by (instance) (
rate(
node_ethtool_conntrack_allowance_exceeded{job="node-exporter"}[1m]
)
)
and the panel will be like this
Other Network Allowance Exceeded Metrics
Add the following panels using the same counter-to-rate approach.
node_ethtool_bw_in_allowance_exceedednode_ethtool_bw_out_allowance_exceedednode_ethtool_pps_allowance_exceedednode_ethtool_linklocal_allowance_exceeded
Each panel shows packets dropped per second per node.
Full Grafana dashboard JSON:
Final Insight
All allowance exceeded metrics are tied to EC2 instance sizing, with one exception.
Link-local traffic has a fixed limit of 1024 packets per second, regardless of instance size.
This explains why DNS can fail even when CPU, memory, and pod-level metrics look healthy.
The bottleneck exists below Kubernetes, at the EC2 networking layer.
Takeaway
If you are debugging intermittent DNS failures on EKS, do not stop at CoreDNS metrics.
Always inspect instance-level network allowances.


Top comments (0)