Ahmed Shendy

Posted on Dec 18, 2025

DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits

#aws #eks #kubernetes #observability

During the DNS investigation, I initially focused on CoreDNS and NodeLocal DNS metrics.

The real breakthrough came when I started correlating DNS failures with AWS instance-level network limits.

The most useful signals came from network allowance metrics exposed by the EC2 ENA driver via ethtool.

AWS Network Allowance Metrics

The following metrics represent network limits enforced at the EC2 instance level.

ethtool_linklocal_allowance_exceeded

Packets dropped because traffic to link-local services exceeded the packets-per-second (PPS) limit.

This directly affects DNS, IMDS, and Amazon Time Sync. If you found this value above zero, you can 1) try increasing the number of CoreDNS replicas or 2) implement NodeLocal DNSCache or 3) check the ndots as mentioned in this post The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)
ethtool_conntrack_allowance_available

Remaining number of connections that can be tracked before reaching the instance’s connection-tracking limit.

Supported on Nitro-based instances only.
ethtool_conntrack_allowance_exceeded

Packets dropped because the connection-tracking limit was exceeded and new connections could not be established.
ethtool_bw_in_allowance_exceeded

Packets queued or dropped because inbound aggregate bandwidth exceeded the instance limit.
ethtool_bw_out_allowance_exceeded

Packets queued or dropped because outbound aggregate bandwidth exceeded the instance limit.
ethtool_pps_allowance_exceeded

Packets queued or dropped because the bidirectional packets-per-second (PPS) limit was exceeded.

All *_allowance_exceeded metrics should ideally remain zero.

Any sustained non-zero value indicates a networking bottleneck at the instance level.
For all metrics except the link-local you can solve it by changing the instance size or type to get a higher network bandwidth or work on reducing the load on this instance.

Capturing Network Metrics in EKS

These metrics are exposed by the EC2 ENA driver via ethtool, collected by node exporter, scraped by Prometheus, and visualized in Grafana.

On Amazon Linux EKS nodes, ethtool is installed by default.

To collect these metrics, the ethtool collector must be enabled in node exporter.

Enable ethtool Collector in node exporter

Add the following arguments to the node exporter container.

containers:
- args:
  - --collector.ethtool
  - --collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+
  - --collector.ethtool.metrics-include=.*

After applying this change, the metrics will become available in Prometheus and Grafana.

Building the Grafana Dashboard

All panels are time series panels, built per node, to help correlate network saturation with DNS errors or latency.

Available Connection Tracking Capacity

The metric exported by node exporter is:

node_ethtool_conntrack_allowance_available

It represents the current number of connections that can still be tracked on each node.
PromQL query:

node_ethtool_conntrack_allowance_available{job="node-exporter"}

Packets Dropped Due to Conntrack Exhaustion

The metric node_ethtool_conntrack_allowance_exceeded is a counter that increases over time.
To calculate packet drops per second, use the rate() function.

sum by (instance) (
  rate(
    node_ethtool_conntrack_allowance_exceeded{job="node-exporter"}[1m]
  )
)

and the panel will be like this

Other Network Allowance Exceeded Metrics

Add the following panels using the same counter-to-rate approach.

node_ethtool_bw_in_allowance_exceeded
node_ethtool_bw_out_allowance_exceeded
node_ethtool_pps_allowance_exceeded
node_ethtool_linklocal_allowance_exceeded

Each panel shows packets dropped per second per node.

Full Grafana dashboard JSON:

Network limits dashboard

Final Insight

All allowance exceeded metrics are tied to EC2 instance sizing, with one exception.

Link-local traffic has a fixed limit of 1024 packets per second, regardless of instance size.

This explains why DNS can fail even when CPU, memory, and pod-level metrics look healthy.

The bottleneck exists below Kubernetes, at the EC2 networking layer.

Takeaway

If you are debugging intermittent DNS failures on EKS, do not stop at CoreDNS metrics.

Always inspect instance-level network allowances.

Top comments (1)

Ahmed Shendy • Dec 19 '25

Thank you for reading 🙏
If you found this helpful, I’d really appreciate a reaction on the article — it helps others discover it as well.
Always happy to discuss or answer any questions.