Debugging Occasional ECONNRESET Errors in Node.js: Root Causes and Fixes

#webdev #devops #cloud #astro

It passes every test on your laptop. It passes in staging. Then production logs a handful of Error: read ECONNRESET lines a day — never the same endpoint twice, never reproducible on demand. You add a retry, the count drops, and you move on without knowing what happened.

That is the worst outcome, because the error is usually telling you something specific about how your connections are managed. Here is what ECONNRESET means, why it clusters around idle connections, and the changes that actually stop it.

What ECONNRESET actually means

ECONNRESET means the peer sent a TCP RST packet. The connection was open and working, and then the other side discarded its half and told your kernel to stop using it. Node surfaces this as an error with code: 'ECONNRESET' and, often, syscall: 'read' — you were waiting to read a response and the socket died underneath you.

It is not the same as the two errors people confuse it with:

Error	What happened
`ECONNREFUSED`	Nothing accepted the connection — wrong port, or the service is down
`ETIMEDOUT`	The connection or response never completed within the allowed time
`ECONNRESET`	The connection was established, then the peer abruptly killed it

The word that matters is abruptly. Something on the other end was fine a moment ago and then was not. That narrows the search: you are not looking for a server that is down, you are looking for a connection that got closed while you still thought you owned it.

Why it is almost always an idle connection

Modern Node HTTP clients reuse connections. Node's http.Agent with keepAlive: true — the default behavior behind globalThis.fetch and undici in Node 18+ — keeps TCP sockets in a pool after each response so the next request skips the TCP and TLS handshake. That is good for latency. It is also the most common source of intermittent resets.

A pooled socket can be closed by the other side while it sits idle in your pool. You will not find out until you write the next request onto it. Your client believes the socket is alive; the server or load balancer already sent a FIN or RST; you push bytes into a dead connection and the reset comes back as ECONNRESET.

Almost everything between you and the origin closes idle connections on a timer:

AWS Application Load Balancers default to a 60-second idle timeout.
nginx defaults keepalive_timeout to 75 seconds.
A Node origin server's server.keepAliveTimeout defaults to 5000 ms.

That gives you a rule that prevents the race: the side that sends requests should give up an idle socket before the side that receives them. When your client holds idle sockets longer than the load balancer does, the load balancer wins the race every time and you eat resets.

The same root cause bites Node servers from the other direction. Node's 5-second keepAliveTimeout is shorter than an ALB's 60-second idle timeout, so the ALB keeps a socket it believes is reusable, sends a request into a connection Node already closed, and returns a 502 to your user. Same race, opposite roles.

HTTP keep-alive and TCP keepalive are different things, and confusing them wastes hours. Reaching for socket.setKeepAlive(true) will not fix this. TCP keepalive is an OS-level probe mechanism, and on Linux net.ipv4.tcp_keepalive_time defaults to 7200 seconds — two hours — before the first probe goes out. It will never detect a load balancer that closed your socket 60 seconds ago.

Two more causes worth ruling in or out: a burst of ECONNRESET during a rolling deploy is expected, because in-flight connections to terminating instances get reset — if the spike lines up with a deploy timestamp, that is the explanation. And an upstream that was OOM-killed or crashed will reset every connection it held.

Fixes that hold up in production

Align your timeouts. The ordering you want is: client idle timeout is shorter than the load balancer idle timeout, which is shorter than the origin server idle timeout. For a Node server sitting behind an ALB, raise its timeouts above the load balancer's:

const server = http.createServer(app);
server.keepAliveTimeout = 65_000; // longer than the ALB 60s idle timeout
server.headersTimeout = 66_000;   // must exceed keepAliveTimeout

For a client behind that same ALB, do the opposite — keep idle sockets for less than 60 seconds so you retire them before the load balancer can.

Retry idempotent requests once. A reset on an idle pooled socket almost always means the request never reached the application — it died on the wire before the server read it. For GET, HEAD, PUT, and DELETE, a single retry on a fresh connection is safe and clears the large majority of these errors. Be deliberate with POST: only retry when you can confirm the request never landed, or you risk a duplicate write.

Prefer undici's pool. Node's built-in fetch is backed by undici, whose connection pool tracks the server's Keep-Alive response header and recycles sockets more carefully than the legacy http.Agent. If you are still on a hand-rolled agent, moving to undici removes a class of stale-socket bugs.

To reproduce the race on demand, shrink the idle window. In staging, set the upstream's idle timeout to one or two seconds, send a request, wait three seconds, then send another through the same pool. The flaky failure becomes deterministic. To see the reset on the wire, run tcpdump -i any 'tcp[tcpflags] & tcp-rst != 0' — the source IP of the RST tells you whether your origin, a proxy, or the load balancer hung up.

Tracing one reset through an agent pool, a retry wrapper, and three layers of timeout config means reading across files that rarely sit next to each other. An editor that can follow that path quickly is worth having open.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.