Marcos Henrique for AWS Community Builders

Posted on Nov 19, 2024

Haunted by EMFILE Issue and some ways to exorcize it

#aws

It was a pitch-black, stormy night the prod 3 AM, right on the dot. The dashboard glowed an unnatural green like everything was chill. Too chill, Any engineer knows that’s when the real action's about to hit.

I was wrapping up my last microservice commit when PagerDuty decided to go full panic mode. The Lambdas? Yeah, they were waking up like zombies, spawning instances all over like some digital outbreak. Our sleek, serverless setup had gone full horror show, a haunted house of EMFILE errors popping off everywhere.

Every engineer’s got that one nightmare incident burned into their brain. This was mine—the story of how our “scalable,” production-ready Lambda farm went rogue, turning into a resource-hungry beast that nearly swallowed our AWS account whole, It's time to dig into the basics and fix this mess...

The Architecture (aka The Problem Factory)

Look, the setup was pretty straightforward (or so I thought):

Lambda function triggered by SQS (basic stuff)
Step Functions orchestration (because parallel processing is the future)
Two more Lambdas: one in a VPC for internet access, another in a private VPC
S3 for state management (turns out, this was way more interesting than expected)

But here's the thing - and this is critical - when you're dealing with distributed systems at scale, the problems you encounter aren't linear. They're exponential.

The EMFILE Apocalypse

So there I was, thinking everything was working beautifully (narrator: it wasn't). Then, EMFILE errors started cascading through the system. It was like watching your entire production infrastructure decide to take an unscheduled vacation. Not great.

My first instinct? Throw a delay at it. Classic engineering response, right? It's like putting a time delay on your rocket launch because the launchpad is too hot. Sure, it works, but it's not exactly pushing the boundaries of innovation.

The Hyperplane Hypothesis

Now, this is where it gets interesting. AWS Hyperplane - many devs think it's some magical connection-sharing unicorn. It's not. Thinking this way is about as useful as expecting your car to transform into a submarine just because it's waterproof.

The Community Breakthrough

Two absolute legends from the serverless community dropped some knowledge bombs:

Omid's Take: "Hey, Hyperplane just manages IP allocation. Your connection problems? That's all you."
- Brutal, but accurate.
- Suggested tweaking SDK clients with keepAlive (basic physics of network optimisation)
Walmsles' Insight: "You're hitting Node.js file descriptor limits - 1024 per environment."
- This is the kind of constraint that separates the pros from the amateurs
- Unlimited concurrency + lots of S3 operations = disaster waiting to happen

The Solution (Because Success is the Only Option)

Time to approach this like an engineering problem:

Concurrency Control Here's how to implement this properly in CDK:

// The right way to handle concurrency in CDK
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';

export class LambdaStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Create your Lambda function
    const fanoutFunction = new lambda.Function(this, 'FanoutFunction', {
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      // Set reserved concurrent executions
      reservedConcurrentExecutions: 100,
      // Other configurations...
      vpc: vpc,
      environment: {
        STEP_FUNCTION_ARN: stepFunction.stateMachineArn,
      },
    });

    // If you need to modify existing function
    const existingFunction = lambda.Function.fromFunctionArn(
      this,
      'ExistingFunction',
      'arn:aws:lambda:region:account:function:name'
    );

    // Apply concurrency settings
    const cfnFunction = existingFunction.node.defaultChild as lambda.CfnFunction;
    cfnFunction.reservedConcurrentExecutions = 100;
  }
}

Client Configuration Optimization

   // Outside the handler - this is just basic efficiency
   const s3Client = new S3Client({
     maxAttempts: 3,
     requestTimeout: 3000,
     keepAlive: true
   });

The Hard Truth About Serverless Scale

Here's what nobody tells you in the AWS documentation (but they should):

Unlimited concurrency is like unlimited acceleration - sounds great until you hit the speed of light
File descriptors are the new bottleneck - forget about CPU and memory
Community knowledge is worth more than any documentation

Final Thoughts

Look, at the end of the day, this isn't just about fixing EMFILE errors. It's about understanding system constraints and working with them, not against them. Like gravity - you can fight it, or you can build a rocket that works with it.

The future of serverless is incredible, but only if we approach it with both ambition AND understanding. Otherwise, we're just building fancy ways to crash our systems.

P.S. Special thanks to the serverless community

If you get interested in this amazing community, please join us at https://www.believeinserverless.com/

DEV Community