TL;DR
- We hit sporadic network errors in a high-throughput Lambda that made HTTP calls (Axios) and AWS SDK calls.
- Root cause: creating new HTTP clients/agents per invocation ballooned the number of open sockets (file descriptors).
- Fix: initialize clients and their
https.Agent
once at module scope with keep-alive and reuse them across warm invocations. For AWS SDK v2, also setAWS_NODEJS_CONNECTION_REUSE_ENABLED=1
.
The scenario
We had a Lambda that was invoked asynchronously to process a large dataset (thousands of events). Inside the handler, we created an Axios client and AWS SDK client(s) for each invocation. Under sustained concurrency, we started seeing intermittent network failures.
Symptoms we saw
These popped up in CloudWatch logs while the Lambda was busy:
- “too many open files” errors:
Error: EMFILE: too many open files, open
NodeError: getaddrinfo ENFILE
- Connection instability:
AxiosError: socket hang up
Error: read ECONNRESET
Error: connect ECONNRESET
- Occasional timeouts and throttling-like behavior despite healthy downstream services
These were worse during bursts when many async invocations overlapped.
What’s really happening (FDs and sockets in Lambda)
- Every TCP connection (HTTP/HTTPS) consumes a file descriptor (FD).
- Lambda execution environments have a relatively low per-process FD limit (commonly around 1024).
- If you create a new HTTP client (and thus a new
https.Agent
) per invocation, each agent can open many sockets. Under high concurrency, you exhaust FDs, leading to the errors above. - Lambda reuses the same execution environment for multiple “warm” invocations. Objects created at module scope are kept alive and reused, which is exactly what we want for clients and connection pools.
Why Node’s https.Agent
matters
- The agent controls connection pooling and keep-alive.
- Creating a new agent per invocation increases the number of socket pools and the total sockets in use.
- Reusing a single agent keeps the number of open sockets bounded and allows connection reuse across requests, reducing FD pressure and latency.
The anti-pattern (what we had)
Creating new clients and agents inside the handler:
import axios from 'axios';
import https from 'https';
// Anti-pattern: runs on every invocation
export const handler = async () => {
const ax = axios.create({
httpsAgent: new https.Agent(), // new agent each time
});
const resp = await ax.get('https://api.example.com/data');
return resp.data;
};
Same issue with AWS SDK if you new
a client per invocation, especially if you also create its own agent.
The fix (module-level reuse with keep-alive)
Move client and agent creation to module scope so they’re created once per warm environment and then reused.
Axios
import axios from 'axios';
import https from 'https';
const httpsAgent = new https.Agent({
keepAlive: true,
maxSockets: 60, // tune based on expected concurrency per environment
maxFreeSockets: 10,
timeout: 30_000, // socket idle timeout
freeSocketTimeout: 30_000,
});
const ax = axios.create({
headers: { 'Content-Type': 'application/json' },
httpsAgent,
});
export const handler = async () => {
const resp = await ax.get('https://api.example.com/data');
return resp.data;
};
AWS SDK v3
import https from 'https';
import { NodeHttpHandler } from '@aws-sdk/node-http-handler';
import { S3Client } from '@aws-sdk/client-s3';
const httpsAgent = new https.Agent({
keepAlive: true,
maxSockets: 60,
maxFreeSockets: 10,
timeout: 30_000,
freeSocketTimeout: 30_000,
});
const s3 = new S3Client({
region: process.env.AWS_REGION,
requestHandler: new NodeHttpHandler({
httpsAgent,
connectionTimeout: 3_000,
socketTimeout: 30_000,
}),
});
export const handler = async () => {
const out = await s3.listBuckets({});
return out;
};
AWS SDK v2
- Reuse clients, and enable connection reuse via env var.
import https from 'https';
import AWS from 'aws-sdk';
// Also set in Lambda env: AWS_NODEJS_CONNECTION_REUSE_ENABLED=1
AWS.config.update({
region: process.env.AWS_REGION,
httpOptions: { agent: new https.Agent({ keepAlive: true, maxSockets: 60 }) },
});
const s3 = new AWS.S3();
export const handler = async () => {
const out = await s3.listBuckets().promise();
return out;
};
Results after the change
- FD-related errors (EMFILE, ENFILE, socket hang ups) disappeared under the same workload.
- Lower p95 latency due to connection reuse.
- Fewer outbound connection spikes visible on NAT Gateway/ENI metrics (for VPC Lambdas).
- More predictable behavior during bursts.
Bonus mitigations
- Concurrency control: use SQS with a sane
maxConcurrency
/batchSize
, reserved concurrency, or step-wise throttling to prevent bursts from scaling FD usage across many environments at once. - Timeouts and retries: set realistic timeouts; add backoff with jitter to avoid synchronized retries.
-
context.callbackWaitsForEmptyEventLoop = false
: can help the handler return even if the agent keeps idle sockets open (don’t overuse). - Consider
undici
for HTTP in Node 18+; it provides efficient HTTP/1.1 keep-alive by default.
Quick checklist
- Initialize HTTP clients and SDK clients at module scope.
- Use a shared
https.Agent
withkeepAlive: true
; setmaxSockets
,maxFreeSockets
, and timeouts. - For AWS SDK v2, set
AWS_NODEJS_CONNECTION_REUSE_ENABLED=1
. - Avoid creating clients/agents inside loops or inside the handler.
- Monitor and tune under realistic concurrency.
Closing thoughts
FD exhaustion is easy to miss until traffic scales. In serverless, the simplest lever is to reuse resources across warm invocations. One shared agent + one shared client per execution environment eliminates a whole class of flaky, intermittent network issues.
Top comments (0)