DEV Community

Cover image for DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits
Ahmed Shendy
Ahmed Shendy

Posted on

DNS Failures in EKS? The Real Bottleneck Was AWS Network Limits

During the DNS investigation, I initially focused on CoreDNS and NodeLocal DNS metrics.

The real breakthrough came when I started correlating DNS failures with AWS instance-level network limits.

The most useful signals came from network allowance metrics exposed by the EC2 ENA driver via ethtool.


AWS Network Allowance Metrics

The following metrics represent network limits enforced at the EC2 instance level.

  • ethtool_linklocal_allowance_exceeded

    Packets dropped because traffic to link-local services exceeded the packets-per-second (PPS) limit.

    This directly affects DNS, IMDS, and Amazon Time Sync. If you found this value above zero, you can 1) try increasing the number of CoreDNS replicas or 2) implement NodeLocal DNSCache or 3) check the ndots as mentioned in this post The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)

  • ethtool_conntrack_allowance_available

    Remaining number of connections that can be tracked before reaching the instance’s connection-tracking limit.

    Supported on Nitro-based instances only.

  • ethtool_conntrack_allowance_exceeded

    Packets dropped because the connection-tracking limit was exceeded and new connections could not be established.

  • ethtool_bw_in_allowance_exceeded

    Packets queued or dropped because inbound aggregate bandwidth exceeded the instance limit.

  • ethtool_bw_out_allowance_exceeded

    Packets queued or dropped because outbound aggregate bandwidth exceeded the instance limit.

  • ethtool_pps_allowance_exceeded

    Packets queued or dropped because the bidirectional packets-per-second (PPS) limit was exceeded.

All *_allowance_exceeded metrics should ideally remain zero.

Any sustained non-zero value indicates a networking bottleneck at the instance level.
For all metrics except the link-local you can solve it by changing the instance size or type to get a higher network bandwidth or work on reducing the load on this instance.


Capturing Network Metrics in EKS

These metrics are exposed by the EC2 ENA driver via ethtool, collected by node exporter, scraped by Prometheus, and visualized in Grafana.

On Amazon Linux EKS nodes, ethtool is installed by default.

To collect these metrics, the ethtool collector must be enabled in node exporter.


Enable ethtool Collector in node exporter

Add the following arguments to the node exporter container.

containers:
- args:
  - --collector.ethtool
  - --collector.ethtool.device-include=(eth|em|eno|ens|enp)[0-9s]+
  - --collector.ethtool.metrics-include=.*
Enter fullscreen mode Exit fullscreen mode

After applying this change, the metrics will become available in Prometheus and Grafana.

Building the Grafana Dashboard

All panels are time series panels, built per node, to help correlate network saturation with DNS errors or latency.

Available Connection Tracking Capacity

The metric exported by node exporter is:

node_ethtool_conntrack_allowance_available

It represents the current number of connections that can still be tracked on each node.
PromQL query:

node_ethtool_conntrack_allowance_available{job="node-exporter"}
Enter fullscreen mode Exit fullscreen mode

AWS instance-level available connections

Packets Dropped Due to Conntrack Exhaustion

The metric node_ethtool_conntrack_allowance_exceeded is a counter that increases over time.
To calculate packet drops per second, use the rate() function.

sum by (instance) (
  rate(
    node_ethtool_conntrack_allowance_exceeded{job="node-exporter"}[1m]
  )
)
Enter fullscreen mode Exit fullscreen mode

and the panel will be like this

Packets Dropped Due to Conntrack Exhaustion


Other Network Allowance Exceeded Metrics

Add the following panels using the same counter-to-rate approach.

  • node_ethtool_bw_in_allowance_exceeded
  • node_ethtool_bw_out_allowance_exceeded
  • node_ethtool_pps_allowance_exceeded
  • node_ethtool_linklocal_allowance_exceeded

Each panel shows packets dropped per second per node.

Full Grafana dashboard JSON:

Network limits dashboard


Final Insight

All allowance exceeded metrics are tied to EC2 instance sizing, with one exception.

Link-local traffic has a fixed limit of 1024 packets per second, regardless of instance size.

This explains why DNS can fail even when CPU, memory, and pod-level metrics look healthy.

The bottleneck exists below Kubernetes, at the EC2 networking layer.


Takeaway

If you are debugging intermittent DNS failures on EKS, do not stop at CoreDNS metrics.

Always inspect instance-level network allowances.

Top comments (0)