Marcos Henrique for AWS Community Builders

Posted on Mar 6

in case of fire put your heap on a cage - how we reduced our mem usage

#aws #tooling #infrastructure

A few weeks ago, I dropped a message in our team Slack that made everyone stop what they were doing:

"The work we have made on axios interceptors refactor + removing in memory cache + v8 tuning seems to improved mem usage A LOT"

Some graphs came with it. One of those graphs where you don't need to read the y-axis to understand what happened 🙇‍♂️, the line simply drops and stays down. AWS confirmed it at the same timestamp.

This is the story of how we got there...

Spoiler: Datadog with its heap snapshot is the chef's kiss

Preface (⚠️ DD&ECS&CDK newbies only)

Just in case you never configured Datadog Tracing and created an ECS Cluster on Fargate fashion, with CDK obviously, if you already did it, jump to the next "Section"...

1. Creating the ECS Cluster

import * as ecs from 'aws-cdk-lib/aws-ecs';
  import * as ec2 from 'aws-cdk-lib/aws-ec2';

  const vpc = ec2.Vpc.fromLookup(this, 'Vpc', { tags: { Name: 'default-vpc' } });

  const cluster = new ecs.Cluster(this, 'MainCluster', {
      clusterName: 'main-cluster',
      vpc,
      // Fargate doesn't need capacity providers or EC2 instances
      containerInsights: true,  // enables CloudWatch Container Insights
  });

1.1 Service Connect (just in case you don't want to pay extra money on egress/ingress)

const cluster = new ecs.Cluster(this, 'MainCluster', {
      clusterName: 'main-cluster',
      vpc,
      containerInsights: true,
      defaultCloudMapNamespace: {
          name: 'playground.local',                        // internal DNS namespace
          type: cloudmap.NamespaceType.DNS_PRIVATE,    // private to the VPC
          vpc,
      },
  });

2. Task definition + Datadog agent sidecar

  const taskDef = new ecs.TaskDefinition(this, 'TaskDef', {
    compatibility: ecs.Compatibility.FARGATE,
    cpu: '1024',
    memoryMiB: '2048',
    networkMode: ecs.NetworkMode.AWS_VPC,
  });

  // Datadog agent sidecar
  taskDef.addContainer('datadog-agent', {
    image: ecs.ContainerImage.fromRegistry('public.ecr.aws/datadog/agent:latest'),
    essential: true,
    cpu: 256,
    memoryReservationMiB: 512,
    environment: {
      DD_API_KEY: datadogApiKey,
      DD_SITE: 'datadoghq.com',
      ECS_FARGATE: 'true',
      DD_APM_ENABLED: 'true',
      DD_APM_NON_LOCAL_TRAFFIC: 'true',
      DD_PROFILING_ENABLED: 'true',      // enables Memory Leaks tab
      DD_RUNTIME_METRICS_ENABLED: 'true',
    },
    portMappings: [
      { containerPort: 8125, protocol: ecs.Protocol.UDP },
      { containerPort: 8126, protocol: ecs.Protocol.TCP },
    ],
  });

3. App container wired to the agent

  taskDef.addContainer('my-api', {
    image: ecs.ContainerImage.fromRegistry('my-ecr-repo/api:latest'),
    essential: true,
    cpu: 768,
    memoryReservationMiB: 1536,
    environment: {
      DD_SERVICE: 'my-api',
      DD_ENV: 'production',
      DD_VERSION: '1.0',
      DD_TRACE_AGENT_URL: 'http://localhost:8126', // points to sidecar
      DD_PROFILING_ENABLED: 'true',
      DD_RUNTIME_METRICS_ENABLED: 'true',
    },
    dockerLabels: {
      'com.datadoghq.tags.service': 'my-api',
      'com.datadoghq.tags.env': 'production',
      'com.datadoghq.tags.version': '1.0',
    },
  });

4. Node.js app — init tracer before anything else

  // tracer.ts — must be the first import in main.ts
  import tracer from 'dd-trace';

  tracer.init({
    profiling: true,       // heap snapshots → Memory Leaks tab
    runtimeMetrics: true,  // Node.js live heap graph
    logInjection: true,    // correlates logs with traces
  });

  // main.ts
  import './tracer';
  import { NestFactory } from '@nestjs/core';
  // ...

///usage...
import tracer from 'dd-trace';                                                                                                                 

  async function fooFunc(memberId: string) {                                                                                                  
    const span = tracer.startSpan('action.subAction', {                                                                                               
      tags: { 'action.id': actionId },                                                                                                           
    });                                                                                                                                          

    try {
      await foo.bar(actionId);
      span.finish();
    } catch (err) {
      span.setTag('error', err);
      span.finish();
      throw err;
    }
  }
///...

How it will look on DataDog

The Dashboard Doesn't Lie

Like most memory issues, ours wasn't dramatic. There was no crash, no alert, no customer ticket... Just a Datadog APM dashboard showing memory usage slowly creeping upward between deployments 👻.
The kind of thing that's easy to rationalise:

maybe it's traffic growth, maybe it's a new feature, maybe it's just Node being Node (since our traffic was growing)

Except it kept going. And at some point, we started asking:

What if it's us?

Spoiler: it was us

The Axios Interceptor Trap

We use axios-retry across a lot of our outbound integrations, so the pattern across all of them looked like this:

async createMeeting(data: MeetingData) {
  axiosRetry(this.http.axiosRef, {
    retries: 3,
    retryDelay: axiosRetry.exponentialDelay,
    retryCondition: axiosRetry.isRetryableError,
  });

  // ... the actual HTTP request
}

Looks fine, right? You're just configuring retry behaviour before making a request. Completely reasonable.

Except axiosRetry doesn't just configure behaviour. It registers interceptors.

Every call to axiosRetry(instance, options) calls instance.interceptors.request.use(...) and instance.interceptors.response.use(...) under the hood. Those arrays in Axios grow with every call. They never get cleaned up. Every function reference that gets pushed there lives on the heap for the lifetime of the process.

Therefore, what was actually happening: every incoming request to our API triggered outbound calls to third-party services. Every one of those outbound calls registered a new pair of retry interceptors. After thousands of requests, I had thousands of interceptors sitting quietly in memory... all alive 🧟‍♂️..., all holding references, all doing redundant work.

The fix is almost embarrassingly 😳 simple: move the registration to the constructor, so it happens exactly once per service instance

constructor(private http: HttpService) {
  axiosRetry(this.http.axiosRef, {
    retries: 3,
    retryDelay: axiosRetry.exponentialDelay,
    retryCondition: axiosRetry.isRetryableError,
  });
}

We applied this across every service that was doing it wrong, yeah, that was a lot of places. And every single one of them was leaking 🤯

This was almost certainly the biggest contributor to what we saw on the graphs

What Lives in Your Heap Shouldn't Have to 🧟‍♂️

While investigating the interceptor issue, we also looked at our MemoryCacheService. It was using a plain JavaScript Map to cache lab scheduling slot data, a totally normal pattern for reducing external API calls.

The issue isn't that caching is bad. It's that in-process heap memory is a shared, finite resource. Lab scheduling data cached indefinitely in a Map competes with the V8 heap for space. As the cache grows, so does GC pressure. As GC pressure grows, pause times get longer, and memory spikes become more visible.

We replaced the in-memory Map with a dedicated Redis (you should use Valkey, but we were using the boomer redis 🥱) connection (DB 5, isolated from our primary data). SCAN-based pattern deletes, PX TTL, JSON serialise/deserialise. The interface to all consumers stayed exactly the same, they never knew the difference. But the data no longer lives in the Node.js heap.

This is a meaningful architectural shift: anything that doesn't need to be in-process shouldn't be. Redis is fast enough for this use case, so cache growth no longer means heap growth.

V8 Knows More Than I Was Telling It

The third piece was more subtle. I added the --optimize-for-size flag to both our API and worker startup commands:

"start:prod": "node --optimize-for-size -r dd-trace/init dist/src/main"

This flag tells V8 to bias toward a smaller memory footprint over raw JIT throughput. In a containerized, multi-instance deployment — where memory is often the binding constraint and CPU headroom is available — that's usually the right trade-off.

The inspiration came from reading Platformatic's deep dive, "We cut Node.js' Memory in half".
They explored V8's pointer compression feature — a mechanism that reduces each heap pointer from 64 bits to 32 bits, effectively halving how much space every JavaScript object consumes in memory. Chrome has used it since 2020. Node.js is only now catching up, and Platformatic did the work to make it accessible.

--optimize-for-size is a lighter version of the same philosophy: tell V8 to prefer compact representations over aggressive JIT optimisations. The JIT compiles more compact bytecode, uses smaller internal structures, and is generally more aggressive about returning memory to the operating system. For a server handling many concurrent short-lived requests rather than long-running CPU-heavy computations, the trade-off consistently lands in your favour.

What the Graphs Showed

The drop was visible and immediate. Not a gradual downward trend — an actual drop at the deployment timestamp, confirmed independently by AWS metrics. The kind of result that makes the whole investigation feel worth it.

One of the engineers on the team put it well in the thread that followed:

"You guys should let the wider dev team know — how you found it and also the reasoning behind. The rest of the team can also pick up the source and the worker, too… I imagine we'll get benefits there too."

That's exactly why I'm writing this.

But Wait!!!! There's a Better Standard Now: `node-caged`

Everything I described above works today, without any tooling changes. But I'd be doing you a disservice if I didn't point to where this is all heading.

The Platformatic team published a Docker image called node-caged that enables V8 pointer compression in Node.js with a one-line Dockerfile swap:

# Before
FROM node:25-bookworm-slim

# After
FROM platformatic/node-caged:25-slim

No code changes. No configuration. Same Node.js APIs, same behaviour, just with every heap pointer compressed from 8 bytes to 4 bytes.

Why does that matter? Because every JavaScript object on the V8 heap contains multiple internal pointers: to its hidden class (shape), to where its properties are stored, to string values, to prototype chains. All of those get cut in half. For real-world applications with mixed workloads — I/O, JSON parsing, middleware chains, database queries — Platformatic's benchmarks showed a 50% reduction in heap usage with only 2.5% average latency overhead.** And counterintuitively, p99 and max latency actually improved by 7-8%, because a smaller heap means the garbage collector has less to scan and fewer long pauses.

The reason this hasn't been the default in Node.js until now is a technical constraint called the "memory cage" — pointer compression originally forced the entire process (main thread and all workers) to share a single 4GB address space. That's fine for Chrome, where each tab runs in its own process. Not fine for Node.js, where workers share a process. Cloudflare sponsored Igalia to introduce IsolateGroups into V8, which gives each isolate its own 4GB cage. That work landed in Node.js 25, and node-caged ships it as a ready-to-use Docker image.

Should you try it? Yes, with one check first. Native addons built on the legacy NAN (Native Abstractions for Node.js) are incompatible, because pointer compression changes V8's internal object layout. Run:

npm ls nan

If nothing shows up, you're in the clear. Most popular native packages have already migrated to Node-API (sharp, bcrypt, canvas, sqlite3, bufferutil) — the only notable holdout is nodegit.

If your dependency tree is clean, the path forward is:

Swap the Docker base image in staging
Watch your memory usage drop
Confirm your p99 stays within SLO (it probably improves)
Ship it

I haven't validated this in our own staging environment yet, but it's next on the list. The gains I got from fixing interceptors, moving to Redis, and adding --optimize-for-size were real and visible. Pointer compression would multiply them.

The original post from Matteo Collina at Platformatic explains the mechanism in detail: We cut Node.js' Memory in half. Worth a full read. The repo is at github.com/platformatic/node-caged.

Debugging Is an Art 👩‍🎨🎨 — and That's the Real Lesson

There's a post I keep coming back to: "The Real Skill in Programming Is Debugging — Everything Else Is Copy-Paste". The argument is that as you grow as an engineer, you spend less time writing new code and more time doing something harder — understanding why a system isn't behaving the way you think it should. Writing code can be learned from docs and examples. Architecture is, in the author's words, "just a high-level copy-paste." But debugging? That's pure problem solving. It's the part that can't easily be replaced, because it requires holding a mental model of a live system and interrogating it.

This investigation felt exactly like that. Nothing in the codebase was wrong in isolation. Every single service calling axiosRetry inside a method was doing something locally reasonable — setting up retry behavior before a request. The problem only existed at the intersection of how Axios works internally, how many times those methods get called over time, and how V8 manages heap references. You can't find that by reading one file. You find it by asking: what does this code actually do to the system over thousands of calls?

That's the art of it. And it's worth naming explicitly, because the fix itself — moving a function call to a constructor — takes about ten seconds. The hours went into understanding why it mattered.

With AI now generating code faster than ever, I think this dynamic is only going to sharpen. The ability to produce syntax is increasingly cheap. The ability to understand what's broken and why — to trace through logs, follow data flows, form and test hypotheses — that's the scarce skill.
Senior engineers aren't people who write more code. They're the ones who can look at a slowly climbing memory graph and not shrug it off.

The Lessons That Actually Stuck

Know what your libraries do to shared state. axiosRetry doesn't configure your axios instance — it modifies it by pushing to an array. Understanding that difference would have caught this pattern immediately. Any library that hooks into request/response pipelines deserves scrutiny around when and how often it registers those hooks.

In-process cache is heap, full stop. Every object in a Map is an object the GC has to account for. For ephemeral, TTL-bounded data, Redis is usually a better home — especially when you're already running it.

V8 has production-facing flags and most teams ignore them. --optimize-for-size is real. V8 pointer compression via node-caged is real. These are not academic curiosities — they're levers that move production graphs.

Memory leaks don't have to be dramatic to be real. Mine wasn't a smoking gun. It was a pattern repeated across a dozen services, each one individually reasonable, all of them together creating a slow bleed. The investigation just required asking: what does this function actually do?

If you've hit similar issues, especially the Axios interceptor accumulation pattern, which I suspect is far more common than people realize, so, yeah I'd love to compare notes.

The work described in this post landed in PR contributed by André Paris and I.

🍻