John

Posted on Jun 26 • Edited on Jul 11 • Originally published at hexisteme.github.io

macOS: nslookup works but curl and Python "Could not resolve host" — the mDNSResponder zombie

#cli #networking #python #tutorial

Originally published on hexisteme notes.

If nslookup resolves a host fine but curl, pip, and Python (requests/httpx) fail with "Could not resolve host," your mDNSResponder daemon has almost certainly entered a non-responsive "zombie" state — and the fix is to restart it (sudo killall -9 mDNSResponder), not to touch your API keys, SDK versions, or code.

This failure is maddening because every signal points the wrong way. nslookup example.com returns a clean IP, so DNS "works." Your network is up. Your code didn't change. Yet curl, pip install, and every Python HTTP call die with "Could not resolve host." People burn hours rotating API keys, downgrading SDKs, and editing config files. None of that is the problem.

The two-DNS-path asymmetry (your fastest diagnosis)

macOS resolves DNS over two independent paths, and the asymmetry is the diagnosis:

Path	Who uses it	Goes through mDNSResponder?
Direct	`nslookup`, `dig`	No — queries DNS servers directly
System resolver	`curl`, Python, `pip`, most apps (`getaddrinfo()`)	Yes — routes to the daemon

The daemon can keep its PID alive while silently refusing to answer. So the direct path (nslookup) succeeds and the system-resolver path (curl) fails on the exact same host. When one path works and the other fails, you are not looking at a network, key, or code problem — you are looking at a sick daemon.

Confirm it in 10 seconds

Use the daemon path directly so you're testing the same route curl uses:

# Uses the mDNSResponder path (same as curl/Python). Empty result = zombie.
dscacheutil -q host -a name example.com

# ...while the direct path still returns a valid IP:
nslookup example.com

If dscacheutil comes back empty but nslookup returns an IP, the daemon is confirmed. One more check proves routing and TLS are fine and isolates the daemon as the sole cause:

curl --resolve example.com:443:<IP-from-nslookup> https://example.com
# Succeeds? Then DNS resolution is the only broken thing.

Fix it weakest-tool-first

# 1) flush + reload (least disruptive)
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# 2) kickstart the service
sudo launchctl kickstart -k system/com.apple.mDNSResponder

# 3) universal hammer (if kickstart prints "Could not find service" —
#    the service path differs across macOS versions)
sudo killall -9 mDNSResponder mDNSResponderHelper
#    launchd's KeepAlive immediately restarts it with a fresh PID.

# verify
pgrep -l mDNSResponder && dscacheutil -q host -a name example.com

Why agents and long-lived dev boxes hit this repeatedly

The zombie is usually triggered by connection-pool exhaustion. Many concurrent outbound long-poll connections — several MCP servers, a local inference/LLM server, and a background job all holding sockets open at once — push mDNSResponder into high CPU until it stops answering.

Measured case: 30+ concurrent long-poll connections drove mDNSResponder to 77% CPU and into the non-responsive state. The recovery alone doesn't stop recurrence — you have to find and cut the connection source. Watch for the active connection count climbing (lsof -i -P -n | wc -l in the 80+ range) and mDNSResponder CPU above ~30%.

"Disabled" in config does not mean the process is dead

A subtle trap when hunting the connection source: a server can be flagged off at the code/config level (an ENABLED=False switch) while its OS process keeps running and keeps holding its long-poll connections. The flag stopped new work from being dispatched but never killed the process, so it kept feeding the pile-up. When auditing, check ps/pgrep for the actual process and its elapsed time — not just the config flag.

pgrep -lf <server-name>        # still there?
ps -o pid,etime,%cpu,command -p <pid>

Prevent recurrence — and the honest limitation

Two durable fixes: reduce concurrent long-poll load (trim always-on servers/MCP endpoints, kill stale inference servers, and on a box that runs for many days, restart mDNSResponder on a schedule), or install a watchdog LaunchDaemon that kickstarts the daemon when its CPU crosses a threshold.

Honest caveat: this exact two-path asymmetry is macOS-specific — it's mDNSResponder/getaddrinfo behavior, not Linux's nsswitch/resolv.conf. And a watchdog that calls launchctl kickstart needs root: implement it as a LaunchDaemon running as root with a narrowly scoped script, not a broad sudo NOPASSWD rule. The watchdog is a band-aid — the real fix is capping concurrent connections.

More notes on building & running an AI agent fleet at hexisteme.github.io/notes.