<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dat Ton</title>
    <description>The latest articles on DEV Community by Dat Ton (@datton94).</description>
    <link>https://dev.to/datton94</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3692342%2F8246bc25-b62a-441b-a7dd-6cc08a54200c.jpg</url>
      <title>DEV Community: Dat Ton</title>
      <link>https://dev.to/datton94</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datton94"/>
    <language>en</language>
    <item>
      <title>How My Client Hit Linux Kernel Network Limits on AWS EKS</title>
      <dc:creator>Dat Ton</dc:creator>
      <pubDate>Sun, 04 Jan 2026 13:09:41 +0000</pubDate>
      <link>https://dev.to/datton94/how-my-client-hit-linux-kernel-network-limits-on-aws-eks-3am5</link>
      <guid>https://dev.to/datton94/how-my-client-hit-linux-kernel-network-limits-on-aws-eks-3am5</guid>
      <description>&lt;p&gt;&lt;em&gt;Hi everyone! 👋 This is a post from my personal notebook. I originally published it on &lt;a href="https://datton94.github.io/My-client-hit-linux-kernel-out-of-quota/" rel="noopener noreferrer"&gt;My Blog&lt;/a&gt; where I document my journey as a DevOps Engineer. I hope you find it useful!&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The issue
&lt;/h1&gt;

&lt;p&gt;This is a story about a tricky issue I resolved recently.&lt;/p&gt;

&lt;p&gt;My client hosts their system on AWS EKS, and I manage their Kubernetes platform. One day, they sent me a ticket saying:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We have recently noticed a lot of curl calls failing from our service that runs in the night time. Application tries to make curl call to application in another namespace via the service&lt;/p&gt;

&lt;p&gt;curl -v &lt;a href="http://service-b-live.namespace-b:8080/api/mobile.html" rel="noopener noreferrer"&gt;http://service-b-live.namespace-b:8080/api/mobile.html&lt;/a&gt;&lt;br&gt;
But we are receiving intermittent curl failures.&lt;br&gt;
This has started only recently from 7th Aug.&lt;/p&gt;

&lt;p&gt;Can we get someone to check if any DNS issues or issues with Load balancer happened around that time. &lt;br&gt;
It has been happening every night since 7th Aug 2025&lt;/p&gt;

&lt;p&gt;Considering it is a critical workflow for the application and we do not want any interruption in that, can these be looked on priority?&lt;/p&gt;

&lt;p&gt;Thank You&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is the cURL output they provided:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;cURL error 65: The cURL request was retried 3 times and did not succeed. The most likely reason for the failure is that the cURL unable to rewind the body of the request and subsquent retries resulted in the same error. Turn on the debug option to see what went wrong.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  The investigation
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Start with the Log
&lt;/h2&gt;

&lt;p&gt;According this cURL doc &lt;a href="https://curl.se/libcurl/c/libcurl-errors.html" rel="noopener noreferrer"&gt;https://curl.se/libcurl/c/libcurl-errors.html&lt;/a&gt; , the 65 error code mean &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;CURLE_SEND_FAIL_REWIND (65)&lt;/p&gt;

&lt;p&gt;When doing a send operation curl had to rewind the data to retransmit, but the rewinding operation failed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From my understanding, &lt;code&gt;cURL 65&lt;/code&gt; is usually a side effect of something worse. Imagine the application is sending data, and suddenly something goes wrong with the network. The data stream is interrupted, and cURL tries to "rewind" to send it again, but fails.&lt;/p&gt;

&lt;p&gt;I checked the ELK logs and found around 15k of these &lt;code&gt;cURL 65&lt;/code&gt; events. That is too many. This suggested a serious network issue, even though the connection was just Pod-to-Pod inside the cluster.&lt;/p&gt;

&lt;p&gt;The client's application technically needs to run thousands of cURL commands because that is their business logic. Since there were so many requests, it was tricky to find the original error that triggered the rewind failure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57zrnwavn791d0eo8xdy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57zrnwavn791d0eo8xdy.png" alt="cURL_logs" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Network Bandwidth
&lt;/h2&gt;

&lt;p&gt;Next, I looked at the network metrics in ELK (collected by metricbeat).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfc5txu8hj3ue97o8bcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfc5txu8hj3ue97o8bcu.png" alt="network_usage" width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I could see their application consumed a lot of bandwidth, around &lt;code&gt;200 MB&lt;/code&gt; to &lt;code&gt;230 MB&lt;/code&gt; per 60 seconds.&lt;/p&gt;

&lt;p&gt;I wondered: Did the network bandwidth of the EC2 instance (EKS worker nodes) exceed the limit?&lt;/p&gt;

&lt;p&gt;I am using &lt;code&gt;t3a.2xlarge&lt;/code&gt; instances as worker nodes. This type provides network bandwidth up to 5 Gbps, which is roughly &lt;code&gt;600 MB&lt;/code&gt; per second. So, &lt;code&gt;200 MB&lt;/code&gt; per 60 seconds is extremely small compared to the limit. Bandwidth was not the problem.&lt;/p&gt;

&lt;p&gt;I tried to find if any other issues occurred at the same time. You know, network issues usually cause a chain reaction of other errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DNS Resolution
&lt;/h2&gt;

&lt;p&gt;I started diving deeper into ELK and found these logs at &lt;code&gt;6th Aug 19:00 UTC&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhrhbmdkb8aost6hnael7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhrhbmdkb8aost6hnael7.png" alt="php_dns_resolution_issue" width="800" height="226"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo for sys.db.REDACTED.REDACTED.internal failed: Try again&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It looked like the PHP code failed to resolve DNS. This suggested something might be wrong with the DNS service.&lt;/p&gt;

&lt;p&gt;In Kubernetes, when a pod asks for DNS resolution, it sends the request to &lt;code&gt;coreDNS&lt;/code&gt;. So I checked the &lt;code&gt;coreDNS&lt;/code&gt; logs. I didn't see any obvious errors. The logs mostly looked like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[INFO]10.234.170.252:55413 - 51380 "A IN sys.db.REDACTED.REDACTED.internal.cluster.local. udp 61 false 512" NXDOMAIN qr,aa,rd 154 0.0000050782s&lt;/p&gt;

&lt;p&gt;[INFO]10.234.170.252:55413 - 51380 "AAAA IN sys.db.REDACTED.REDACTED.internal.cluster.local. udp 61 false 512" NXDOMAIN qr,aa,rd 154 0.0000089841s&lt;/p&gt;

&lt;p&gt;[INFO]10.234.170.252:40561 - 17825 "AAAA IN sys.db.REDACTED.REDACTED.internal. udp 61 false 512" NOERROR qr,aa,rd 154 0.0000045891s&lt;/p&gt;

&lt;p&gt;[INFO]10.234.170.252:40561 - 17429 "AAAA IN sys.db.REDACTED.REDACTED.internal. udp 61 false 512" NOERROR qr,aa,rd 154 0.0000068391s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is actually normal behavior.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;coreDNS&lt;/code&gt; first tries to append cluster.local as a suffix. The record didn't exist, so it returned &lt;code&gt;NXDOMAIN&lt;/code&gt; (Non-Existent Domain).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then, &lt;code&gt;coreDNS&lt;/code&gt; tried the original domain name without the suffix. It found the record and returned NOERROR.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;My client saw &lt;code&gt;NXDOMAIN&lt;/code&gt; and thought it was the root cause. They didn't know about this specific behavior in Kubernetes, so I had to explain it to them. I'm writing it here to remind myself too!&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;coreDNS&lt;/code&gt; exposes metrics via this config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prometheus 0.0.0.0:9153
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this is only for Prometheus.&lt;/p&gt;

&lt;p&gt;Also, AWS EC2 instances have limits not just on bandwidth, but also on Connections and Packets Per Second (PPS).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metricbeat struggles to collect these specific EC2 network metrics. So, I decided it was time to implement Prometheus with &lt;code&gt;node-exporter&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Off-topic note: Why didn't I have Prometheus from the start? My client thought Metricbeat + ELK was enough. But in my opinion, Prometheus is much better for the Kubernetes world. Now was the perfect time to prove it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install Prometheus Stack
&lt;/h2&gt;

&lt;p&gt;I used the &lt;code&gt;kube-prometheus-stack&lt;/code&gt; Helm chart and managed the deployment via &lt;code&gt;ArgoCD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I customized &lt;code&gt;values.yaml&lt;/code&gt; to enable &lt;code&gt;kubelet&lt;/code&gt;, &lt;code&gt;coreDNS&lt;/code&gt;, and &lt;code&gt;nodeExporter&lt;/code&gt; metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kubelet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;serviceMonitor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;coreDns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# collect the coreDNS metrics&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;serviceMonitor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;nodeExporter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;prometheus-node-exporter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# These configurations below are important, they instruc the node export to collect metrics for PPS and Connections&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+)($|/)&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--collector.ethtool&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--collector.ethtool.metrics-include=(bw_.*|pps_allowance_exceeded|linklocal_allowance_exceeded|conntrack_.*)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Investigate with Prometheus metrics
&lt;/h2&gt;

&lt;p&gt;Now I had the data I needed.&lt;/p&gt;

&lt;p&gt;First, I checked &lt;code&gt;coreDNS&lt;/code&gt; again. The PHP DNS issue happened at &lt;code&gt;6th Aug 19:00 UTC&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc65hid2epybm7c8to8l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmc65hid2epybm7c8to8l.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I queried for all return codes (rcode) other than NOERROR. I only saw &lt;code&gt;NXDOMAIN&lt;/code&gt;. There was no &lt;code&gt;SERVFAIL&lt;/code&gt; or &lt;code&gt;REFUSED&lt;/code&gt;. This confirmed coreDNS was healthy.&lt;/p&gt;

&lt;p&gt;So, &lt;code&gt;coreDNS&lt;/code&gt; was fine, but the PHP app still failed to resolve DNS. This suggested something was blocking the traffic from the App to &lt;code&gt;coreDNS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Time to check the AWS Network Interface metrics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Supported on&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bw_in_allowance_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of packets queued or dropped because the inbound aggregate bandwidth exceeded the maximum for the instance.&lt;/td&gt;
&lt;td&gt;All instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bw_out_allowance_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of packets queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance.&lt;/td&gt;
&lt;td&gt;All instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;conntrack_allowance_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of packets dropped because connection tracking exceeded the maximum for the instance and new connections could not be established. This can result in packet loss for traffic to or from the instance.&lt;/td&gt;
&lt;td&gt;All instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;conntrack_allowance_available&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of tracked connections that can be established by the instance before hitting the Connections Tracked allowance of that instance type.&lt;/td&gt;
&lt;td&gt;Nitro-based instances only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;linklocal_allowance_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of packets dropped because the PPS of the traffic to local proxy services exceeded the maximum for the network interface. This impacts traffic to the Amazon DNS service, the Instance Metadata Service, and the Amazon Time Sync Service, but does not impact traffic to custom DNS resolvers.&lt;/td&gt;
&lt;td&gt;All instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pps_allowance_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The number of packets queued or dropped because the bidirectional PPS exceeded the maximum for the instance.&lt;/td&gt;
&lt;td&gt;All instance types&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And after check, I surprise that only the &lt;code&gt;pps_allowance_exceeded&lt;/code&gt; has data &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2edctv4bml50jgnzwv3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2edctv4bml50jgnzwv3m.png" alt="pps_metrics" width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although the metric show that there were some dropped packets but not too much, they could not cause 15k &lt;code&gt;cURL 65&lt;/code&gt; error. &lt;br&gt;
For the other metrics, all of them look good, nothing was dropped. &lt;/p&gt;

&lt;p&gt;Once again, this drive me to blocked route. &lt;/p&gt;

&lt;p&gt;If the network limit is not the root cause what could be?&lt;/p&gt;
&lt;h2&gt;
  
  
  The CPU
&lt;/h2&gt;

&lt;p&gt;Could the worker node be overloaded? If the CPU is overloaded, it increases latency. Processes have to wait longer for CPU time, which leads to network timeouts.&lt;/p&gt;

&lt;p&gt;I checked the CPU Utilization of the worker nodes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr1k94llyio5camkoc85.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flr1k94llyio5camkoc85.png" alt="cpu_utilization" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was only around 60%. Not full. I also checked &lt;code&gt;Load Average&lt;/code&gt;, and it looked fine (I forgot to take a screenshot of that, sorry!).&lt;/p&gt;

&lt;p&gt;But while looking at CPU metrics, I noticed this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nsxfz6yjlcrwj65qkli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9nsxfz6yjlcrwj65qkli.png" alt="cpu_squeeze" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CPU Softnet Times Squeezed&lt;/code&gt; was high, around 60 to 150 per second.&lt;/p&gt;

&lt;p&gt;What does &lt;code&gt;CPU Softnet Times Squeezed&lt;/code&gt; mean?&lt;br&gt;
&lt;a href="https://www.netdata.cloud/blog/understanding-interrupts-softirqs-and-softnet-in-linux/" rel="noopener noreferrer"&gt;https://www.netdata.cloud/blog/understanding-interrupts-softirqs-and-softnet-in-linux/&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Squeezed:&lt;/strong&gt; This dimension shows the number of times the network device budget was consumed or the time limit was reached, but more work was available. The network device budget is a resource that is allocated to the softnet code to process incoming packets. When the budget is consumed or the time limit is reached, the softnet code may not be able to process all of the available packets. In this case, the softnet code will “squeeze” the remaining packets into the next budget or time slice. If you are seeing a high number of squeezed packets, it may indicate that your network interface is not keeping up with the workload and needs to be optimized.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This made sense! As I mentioned, the application was sending thousands of requests in a very short time (likely more than 20k requests).&lt;/p&gt;

&lt;p&gt;The Linux kernel has a specific limit on how many packets it will process in a single "poll cycle". Here are the common default values in most Linux distros:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;net.core.netdev_budget &lt;span class="o"&gt;=&lt;/span&gt; 300 &lt;span class="c"&gt;# Max packets processed in one poll cycle&lt;/span&gt;
net.core.netdev_budget_usecs &lt;span class="o"&gt;=&lt;/span&gt; 2000 &lt;span class="c"&gt;# Time Budget to handle the packets, default 2 miliseconds&lt;/span&gt;
net.core.netdev_max_backlog &lt;span class="o"&gt;=&lt;/span&gt; 1000 &lt;span class="c"&gt;# Max packets queued if the kernel can't keep up.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the kernel hits the &lt;code&gt;netdev_budget&lt;/code&gt; limit before clearing the queue, it stops processing and increments the "squeezed" counter. The remaining packets have to wait, causing delays and timeouts.&lt;/p&gt;

&lt;p&gt;This was the root cause.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Solution
&lt;/h1&gt;

&lt;p&gt;I needed to increase these limits. I created a new file in &lt;code&gt;/etc/sysctl.d/&lt;/code&gt; on the worker nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;nano /etc/sysctl.d/99-network-tuning.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And added these configurations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;net.core.netdev_budget &lt;span class="o"&gt;=&lt;/span&gt; 600
net.core.netdev_budget_usecs &lt;span class="o"&gt;=&lt;/span&gt; 4000
net.core.netdev_max_backlog &lt;span class="o"&gt;=&lt;/span&gt; 2000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we are on AWS EKS, I couldn't just SSH in and change it manually (because nodes are ephemeral). I updated the Launch Template for the Worker Node Auto Scaling Group. I added a small script in the &lt;code&gt;user-data&lt;/code&gt; section to apply these kernel settings on boot.&lt;/p&gt;

&lt;p&gt;After the update, the squeezed metric dropped, and the cURL errors disappeared. The issue was finally resolved.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;I am Dat, a DevOps Engineer based in Vietnam. I love solving infrastructure mysteries and building better systems.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Read more&lt;/strong&gt;: If you like this post, check out my &lt;a href="https://datton94.github.io/" rel="noopener noreferrer"&gt;Personal Blog&lt;/a&gt; for more DevOps notes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Networking&lt;/strong&gt;: I’m always open to chatting about Kubernetes, Cloud, or Platform Engineering. Feel free to say hi on &lt;a href="https://www.linkedin.com/in/dat-ton-that-thanh-928704111/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(If you spot any mistakes in my post, please leave a comment. I am here to learn!)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>linux</category>
      <category>devops</category>
      <category>aws</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
