DEV Community

rohan bhosale
rohan bhosale

Posted on

From 15-Minute Lambda Timeouts to Sub-Second Runs — A DynamoDB Optimization Story

The morning the dashboards went quiet

It started, as these things usually do, with a Teams ping.

"Hey, the subscription stats haven't refreshed in two days."

TTwo scheduled Lambdas in our observability project — calculateSubscriptionStats and calculatePartnerNotificationStats — were responsible for crunching the daily numbers our internal teams relied on. For a while they had been slow but working. Now they were just… not finishing.

A quick look at CloudWatch confirmed the bad news. Both jobs were hammering against the 15-minute Lambda ceiling (900,000 ms), getting killed mid-flight, and occasionally tripping OutOfMemory errors before the timeout could even register.

Here's what the duration log looked like before I touched anything — every single run, day after day, glued to the 900,000 ms ceiling:

calculateSubscriptionStats — multiple invocations per day, all timing out:

CloudWatch durations before optimization — both Lambdas timing out at 900,000ms

calculateSubscriptionStats — multiple invocations per day, all timing out:

CloudWatch durations before optimization — both Lambdas timing out at 900,000ms

Every row in both panels is 900,000.00 — Lambda's hard 15-minute ceiling. The jobs weren't slow, they were dying. And because they died mid-flight, the downstream stats were either stale or partial.

The pattern was clear: the more data we accumulated, the closer to the cliff we got. And we'd just gone over it.

Why it was actually breaking

I expected to find one bad query. Instead I found three compounding problems, each making the others worse.

Problem 1 — We were downloading everything, every time

Our DynamoDB access layer was written in the just-give-me-the-row style. Every call returned the full item schema. For our high-volume tables (Addon, Bundle, Partner Notifications), each item carried fat fields nobody downstream actually used: oldData, newData, and deeply nested metadata blobs from audit history.

DynamoDB paginates results at a hard 1 MB per page. When 70% of each item's bytes are fields we'll throw away, the math gets ugly fast:

  • More pages → more network roundtrips
  • More JSON to parse → more CPU
  • More objects in heap → Node.js GC working overtime
  • Eventually → JavaScript heap out of memory

Problem 2 — Recomputing history every single day

calculatepartnerNotificationStats was doing something even worse: a full table sweep on every invocation, just to compute running distributions like uniqueCustomers and monthlyCounts.

This works fine when the table has 10,000 rows. It is a death sentence when the table has millions and grows daily. Yesterday's run had to re-read yesterday's data and every day before it — forever.

Problem 3 — The accumulators lived in Lambda memory

Because the running totals were rebuilt from scratch each run, there was no concept of "previous state." The Lambda was simultaneously the worker and the source of truth. If it died, the numbers were just… wrong until the next successful run.

The fix, in three layers

I'll walk through these in the order I implemented them, because each one unlocked the next.

Layer 1 — ProjectionExpression everywhere

The cheapest win, and the one I should have done a year ago. Instead of fetching whole items, ask DynamoDB for exactly the fields you'll use.

Before:

const params = {
  TableName: 'PartnerNotifications',
  KeyConditionExpression: 'partner_id = :pid',
  ExpressionAttributeValues: { ':pid': partnerId },
};
const result = await ddb.query(params).promise();
// result.Items each ~8KB, mostly oldData/newData/metadata we don't use
Enter fullscreen mode Exit fullscreen mode

After:

const params = {
  TableName: 'PartnerNotifications',
  KeyConditionExpression: 'partner_id = :pid',
  ProjectionExpression: 'smc, sk, created_at, #status, #type, #event',
  ExpressionAttributeNames: {
    '#status': 'status',
    '#type':   'type',
    '#event':  'event',
  },
  ExpressionAttributeValues: { ':pid': partnerId },
};
const result = await ddb.query(params).promise();
// result.Items each ~600 bytes — same rows, fraction of the bytes
Enter fullscreen mode Exit fullscreen mode

A gotcha I hit immediately: status, type, event, action, plan, and bundle are all DynamoDB reserved words. The query throws ValidationException the moment you use them in a ProjectionExpression without aliasing. ExpressionAttributeNames with #-prefixed placeholders fixes it cleanly.

The payoff: because DynamoDB packs items into 1 MB pages, slimmer items mean exponentially more rows per page. We estimated network payload and memory dropped by 80–90% across these queries. Same data, way less weight.

Layer 2 — Incremental delta fetching with S3 as the state store

This is where the real timeout fix came from. Even with skinnier items, scanning the entire history daily was still O(n) on a dataset growing daily. The trick is to stop doing that.

The new flow:

  1. Read previous state from S3 at the top of the Lambda — stats.json and event-stats.json. Each carries a top-level lastExecutionTime.
  2. Query DynamoDB for the delta only — pass lastExecutionTime into the KeyConditionExpression (when the sort key is time-based) or FilterExpression (when it isn't).
  3. Write the new state back to S3 at the end, with the updated lastExecutionTime set to the start of this run.

Sketch:

// 1. Hydrate previous state from S3
const prevStats = await s3GetJson('observability/stats.json');
const lastExecutionTime = prevStats.lastExecutionTime;
const nowIso = new Date().toISOString();

// 2. Fetch only what's new
const params = {
  TableName: 'PartnerNotifications',
  KeyConditionExpression: 'partner_id = :pid AND created_at > :since',
  ProjectionExpression: 'smc, sk, created_at, #status, #event',
  ExpressionAttributeNames: { '#status': 'status', '#event': 'event' },
  ExpressionAttributeValues: {
    ':pid':   partnerId,
    ':since': lastExecutionTime,
  },
};
const delta = await queryAllPages(ddb, params);

// 3. Hand the delta + previous state to the processor (next layer)
const newStats = processPartnerEventLogs(delta, prevStats);

// 4. Persist new state
await s3PutJson('observability/stats.json', {
  ...newStats,
  lastExecutionTime: nowIso,
});
Enter fullscreen mode Exit fullscreen mode

S3 here is the perfect fit: cheap, durable, atomic on a single PUT, and the file is small enough to read and write in milliseconds. This is the move that took calculatePartnerNotificationStats from "times out at 15 minutes" to "finishes in under a second."

Layer 3 — In-memory merging instead of recomputation

Incremental fetching breaks if your processors still think they're seeing the whole dataset. They'd overwrite the running totals with just today's slice.

So I refactored the analytic processors — processPrimePartnerEventLogs, processMaxPartnerEventLogs, processScheduledActionsCalculation — to hydrate themselves from the previous snapshot and merge the delta in:

function processPartnerEventLogs(deltaItems, prevEventAnalysis) {
  // Start from previous state, not from zero
  const analysis = structuredClone(prevEventAnalysis ?? emptyAnalysis());

  for (const item of deltaItems) {
    // Increment counters
    analysis.totalEvents += 1;
    analysis.eventsByType[item.event] = (analysis.eventsByType[item.event] ?? 0) + 1;

    // Maintain a set-like structure for uniqueness across runs
    if (!analysis.seenCustomers[item.smc]) {
      analysis.seenCustomers[item.smc] = true;
      analysis.uniqueCustomers += 1;
    }

    // Bucket into monthly distribution
    const month = item.created_at.slice(0, 7); // YYYY-MM
    analysis.monthlyCounts[month] = (analysis.monthlyCounts[month] ?? 0) + 1;
  }

  return analysis;
}
Enter fullscreen mode Exit fullscreen mode

The accumulators now live in S3 between runs. Lambda becomes a pure incremental worker — read state, read delta, merge, write state. Stateless code, stateful job.

The numbers, after

Once the changes were deployed, the durations dropped off a cliff — in the good direction this time.

calculateSubscriptionStats — from 900,000 ms down to ~88,000 ms, every run:

CloudWatch durations after optimization — sub-second and ~1.5 min runs

calculatePartnerNotificationStats — from 900,000 ms down to under a second:

CloudWatch durations after optimization — sub-second and ~1 second runs

So:

  • calculateSubscriptionStats: 900,000 ms → ~88,000 ms. About a 10× speedup, and more importantly, no longer timing out.
  • calculatePartnerNotificationStats: 900,000 ms → ~700 ms. A ~1,300× speedup. From timing out to finishing before you finish reading this sentence.

calculateSubscriptionStats is still chunkier because it processes more partners with more per-partner logic. There's room to push it further (parallelizing per-partner work via Promise.all is the next obvious lever), but it now lives well inside its budget.

Lessons I'm taking with me

A few things I'd tell past-me before he started:

1. ProjectionExpression should be the default, not the optimization. If you don't need a field downstream, don't ask for it. The 1 MB page boundary makes this not a micro-optimization — it's a structural multiplier.

2. "Full scan on a schedule" is a time bomb. It runs fine until the day it doesn't, and that day is rarely the day you have spare hours to fix it. If a job recomputes history every run, ask whether it needs to.

3. Lambda memory ≠ application state. Anything stateful has to live somewhere durable between invocations. S3 is a wildly underrated state store for small-to-medium snapshots — cheap, atomic on single PUTs, and trivial to evolve.

4. Watch out for the double-fire. Now that the job merges deltas, an EventBridge misfire (or a manual re-run before S3 is updated) could double-count. We're idempotent-ish today via the lastExecutionTime cursor, but I want to harden that with an explicit lock or version check next.

5. Reserved words will bite you the moment you reach for ProjectionExpression. Just alias everything via ExpressionAttributeNames from the start — future-you will thank present-you.


If you've hit the same wall with DynamoDB-backed stats jobs, I'd love to hear what you tried. Especially if you've found a clean pattern for the double-fire idempotency problem — that's the next puzzle on my desk.

Top comments (0)