keeper

Posted on May 29

The Router Couldn't See My NAS — A 3-Hour Debug Into a Silent Intel NIC Bug

#hardware #linux #networking #devops

The Router Couldn't See My NAS

My TRIM NAS at 192.168.3.135 had been running Hermes Agent 24/7 for months — handling Telegram gateway, proxy routing (mihomo), cron jobs, and file serving. It's a solid Debian 12 box with a custom 6.12.18 kernel.

Then one evening I noticed I couldn't reach Telegram. Then my Windows machine (T2) lost internet. Then even the router's device list showed the NAS as offline.

Ping to the router gateway (192.168.3.1) worked fine from the NAS itself. But the router couldn't see it.

This is the story of what I found — and how the real culprit was a feature meant to save electricity.

Phase 1: The Obvious Suspect — Mihomo

The first thing I checked was mihomo, my proxy daemon. It's the gateway between my LAN and the outside world. If it crashes, everything behind it loses connectivity — Telegram, web browsing, API calls — the whole stack.

May 29 23:52:17 trim-0c5b mihomo[692755]: level=fatal msg="Parse config error:
  rules[14] [DOMAIN-SUFFIX,api.telegram.org,📱DEFAULT] error:
  proxy [📱DEFAULT] not found"

Clear enough. The config had rules referencing proxy groups (📱DEFAULT and 📱Telegram) that no longer existed — likely from a subscription update that replaced the proxy-groups section but left the old rules untouched. Mihomo refused to start. Fix was straightforward: point those rules to an existing group (🚀节点选择).

But this was the consequence, not the root cause. Why did I reset the NAS in the first place?

Phase 2: Looking Deeper — System Logs Tell a Story

The reset happened at 22:47. Working backwards through /var/log/syslog.1, I found the real timeline:

21:00 — node[368932]: [fetch-timeout] fetch timeout after 10000ms
         url=https://api.telegram.org/bot***/getMe
21:01 — same timeout
21:02 — same timeout
...repeats every 60 seconds until 22:47...

This was the Hermes Telegram gateway — a Node.js process — unable to reach Telegram's API. Each request timed out after 10 seconds. Meanwhile:

21:09 — mihomo error: 🇺🇸美国圣何塞06 failed health check: context deadline exceeded
21:19 — 🇯🇵日本东京03, 🇯🇵日本东京04 also failing
21:24 — 🇹🇼台湾, 🇺🇸洛杉矶 also failing
...proxy node health checks kept failing in waves...

The proxy nodes were dropping one by one. Curl through the proxy to Google worked fine, but sustained connections kept timing out.

Key insight: The NAS was technically online (I could SSH in), but under high proxy load, the NIC was becoming invisible to the router. The router's ARP table expired and couldn't re-resolve the NAS's MAC address.

No kernel panics. No OOM. No driver crash. No link-down events in dmesg, syslog, or kern.log. The NIC just... stopped responding at layer 2 under sustained load.

Phase 3: Hardware Fingerprint

03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection
Driver: igb (in-kernel, version 6.12.18-trim)
Firmware: 0.4-1

Intel I211. Not the notoriously buggy I225/I226 that plagues many 2.5GbE boards — just a plain, reliable old 1GbE chip. I211 has been shipping since 2012. It should be bulletproof.

But the boot params told a different story:

GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off"

pcie_aspm=off — someone had already encountered PCIe power management issues with this NIC before me.

Phase 4: The Real Culprit — EEE

I checked the NIC's Energy Efficient Ethernet settings:

EEE status: enabled - inactive
Supported EEE link modes: 100baseT/Full, 1000baseT/Full
Advertised EEE link modes: 100baseT/Full, 1000baseT/Full

EEE (IEEE 802.3az) lets the NIC drop into a low-power idle state between packets. On paper it saves a few watts. In practice, on the igb driver with an I211, the NIC sometimes fails to properly re-establish the link when exiting EEE under high connection churn.

This is exactly what was happening:

The Telegram gateway was creating and closing connections at high frequency (one timeout-retry every 60 seconds × hours = hundreds of connections)
Each connection teardown triggers EEE negotiation
At some point, the link doesn't come back cleanly
The router sees the port as active (switch lights are on) but gets no response to ARP queries
From the router's perspective: NAS is gone

EEE is the most common cause of "link light is on but device is unreachable" on Intel NICs. It affects I210, I211, I225, and I226 to varying degrees.

Phase 5: The Three Fixes

Fix 1 — Kill EEE (immediate)

ethtool --set-eee enp3s0 eee off

This disables EEE immediately without a reboot. The NIC stays in full-power mode and never enters low-power idle. Confirmed:

EEE status: disabled

Fix 2 — Double the Ring Buffers

The default ring buffer on the igb driver is 256 descriptors for both RX and TX. Maximum is 4096. Under sustained proxy load with hundreds of concurrent connections, 256 is a bottleneck — the NIC runs out of buffer space and starts dropping packets.

ethtool -G enp3s0 rx 2048 tx 2048

This increases the buffer to 2048 descriptors each. The NIC now has 8× more room to queue packets before dropping them.

Fix 3 — Make It Stick (persistence)

For the boot params:

# /etc/default/grub
GRUB_CMDLINE_LINUX="modprobe.blacklist=pcspkr pcie_aspm=off igb.eee=0"

igb.eee=0 tells the igb kernel module to never enable EEE, regardless of what the link partner advertises.

For the ring buffer and EEE state, I created a systemd oneshot service:

[Unit]
Description=NIC tuning - Intel I211 fixes
After=network.target
Before=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/ethtool --set-eee enp3s0 eee off
ExecStart=/usr/sbin/ethtool -G enp3s0 rx 2048 tx 2048
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

This runs before the network is declared online, so every service that follows sees the tuned NIC.

What I Learned

The hardest bugs leave no logs. The NIC didn't crash, didn't report errors, didn't trigger a kernel oops. It just silently stopped responding to ARP. If I hadn't checked the EEE status, I'd still be blaming mihomo or the router.

EEE is a false economy. The power savings on a 1GbE desktop NIC are negligible — maybe 0.3-0.5 watts. The stability cost far exceeds the benefit. For any always-on server, NAS, or gateway running the igb driver: turn EEE off.

Ring buffer defaults are tuned for desktops, not servers. 256 descriptors is fine for a web browser but chokes under proxy load. Doubling or quadrupling it costs zero overhead in practice and eliminates an entire class of packet-drop edge cases.

The mihomo config bug was a distraction. It was the symptom that I noticed, but the real problem was at layer 2. If I'd just fixed mihomo and moved on, the EEE drop would have come back within days.

Current State

EEE status:          disabled
Ring buffer RX/TX:   2048 / 2048
Boot param:          pcie_aspm=off igb.eee=0
Systemd service:     hermes-nic.service (enabled)

Router has been seeing the NAS continuously for 24+ hours since the fixes. Telegram gateway stable. Proxy health checks clean.

Using this? A ⭐ or a one-word issue tells me what to build next — helps more than you'd think.

DEV Community