DEV Community

Cover image for Automated Custom Metrics in AWS Lambda Using Embedded Metric Format (EMF)

Automated Custom Metrics in AWS Lambda Using Embedded Metric Format (EMF)

Automated Custom Metrics in AWS Lambda Using Embedded Metric Format (EMF)

Your Lambda ran. Great. But did it actually do anything useful?


Why should you care about custom metrics?

Lambda gives you five metrics out of the box — Invocations, Errors, Duration, Throttles, and ConcurrentExecutions. They're useful. I check them regularly. But they only tell you whether your function ran, not whether it did anything meaningful.

I learned this the hard way on a FinTech project. We had an SQS-triggered Lambda processing payment events. CloudWatch looked fine — invocations ticking, errors at zero, duration well within limits. But the queue depth was quietly climbing. By the time someone noticed, we had thousands of backed-up events and a very uncomfortable conversation with the client.

The Lambda was executing. It just wasn't processing payments.

When something goes wrong, stakeholders don't understand what "Lambda Duration p95 increased by 400ms" means. They do understand "EventFailed went from 0 to 847 in ten minutes." Custom business metrics give you a number both engineering and product can look at and agree on. They make on-call more actionable, post-mortems more honest, and dashboards useful to more than one person.

The three I now add to every event-driven Lambda:

  • EventReceived — the function was invoked with something to process
  • EventProcessed — it completed successfully end-to-end
  • EventFailed — something went wrong

The ratio between received and processed is the most important number in any pipeline. If they diverge, you have a problem — and you'll know before the queue backs up.


What native Lambda metrics can and can't tell you

Here's what you actually get.

Invocations — how many times the function was triggered. Traffic volume, nothing else.

Errors — unhandled exceptions. If your code catches and swallows an exception, this metric never moves. Which is exactly what happened in the payment scenario above.

Duration — execution time in milliseconds. Useful for cost and performance. Useless for correctness.

Throttles — concurrency limit hits. An infrastructure concern, not an application concern.

ConcurrentExecutions — parallel instances running. Infrastructure again.

All five can be healthy while your system is silently broken. A Lambda that catches every exception, logs it, and returns successfully won't trip a single one of these — even if every invocation failed its actual purpose.

The metrics tell you the machine is running. They don't tell you the machine is producing anything worth having.


The old approach: PutMetricData

The traditional way to publish custom metrics from Lambda is PutMetricData — an explicit API call to CloudWatch. I've shipped this. It works. But the trade-offs pile up fast in Lambda specifically.

It adds latency. Every metric is a synchronous HTTPS call to CloudWatch. In a Lambda that processes SQS records, that's an extra network call per invocation, per record if you're not batching carefully.

It needs extra IAM permissions. Your execution role needs cloudwatch:PutMetricData. Small thing individually, but it adds up — more surface area to audit and manage across environments.

It needs retry logic. CloudWatch will occasionally throttle or return errors. Without retry handling, your metric silently drops. With it, you're adding failure-handling code to a handler that should be focused elsewhere.

It's another outbound dependency. Your Lambda probably already calls a database, a queue, or a third-party API. Every additional dependency is another thing that can fail or slow you down.


Introducing AWS Embedded Metric Format (EMF)

What is it?

EMF is a specification that lets you embed CloudWatch metric definitions directly inside your log lines. Instead of a separate API call, you write a specially structured JSON object to stdout. The CloudWatch Logs agent recognizes it, extracts the metrics, and publishes them to CloudWatch Metrics asynchronously.

No extra API call. No extra IAM permission. Zero latency impact on your handler.

AWS built EMF specifically for serverless and container workloads where PutMetricData is an awkward fit. The spec is public and versioned, the library is maintained by AWS, and the behavior is predictable: write the JSON, CloudWatch handles the rest.

Why does it fit Lambda so well?

Lambda already ships logs to CloudWatch Logs. Every console.log goes there. EMF just adds a known structure to some of those lines — structure that CloudWatch knows how to parse into metrics. No new pipeline, no new permissions, no new failure mode.


How does EMF actually work?

The raw payload

Here's what an EMF log line looks like on the wire:

{
  "_aws": {
    "Timestamp": 1718000000000,
    "CloudWatchMetrics": [
      {
        "Namespace": "OrderService/Events",
        "Dimensions": [["Environment", "Service"]],
        "Metrics": [
          { "Name": "EventReceived", "Unit": "Count" },
          { "Name": "EventProcessingDuration", "Unit": "Milliseconds" }
        ]
      }
    ]
  },
  "Environment": "production",
  "Service": "order-processor",
  "EventSource": "sqs",
  "MessageId": "msg-4f8a92b1",
  "EventReceived": 1,
  "EventProcessingDuration": 87
}
Enter fullscreen mode Exit fullscreen mode

_aws is the EMF envelope. Without it, this is just a regular log entry. CloudWatch Logs scans every line for this key.

Timestamp is epoch milliseconds. CloudWatch uses this as the metric timestamp — not the log ingestion time. This matters for accurate time-series charts and alarm evaluation.

CloudWatchMetrics defines what gets extracted. Each entry has a namespace, a dimension set, and the names of the metric fields in this log line.

Namespace is where the metric lands in CloudWatch. I use ServiceName/Domain — e.g. OrderService/Events, PaymentService/Transactions. Naming by bounded context rather than by function name means your metrics survive refactors and team reorganisations.

Dimensions is a double array — [["Environment", "Service"]]. Each inner array is a dimension set: the combination of keys CloudWatch uses to create a distinct time series. You can define multiple sets in a single payload if you want to aggregate the same metric differently.

Everything else at the top level — EventSource, MessageId — is either a metric value (if listed in Metrics) or a property. Properties stay in the log but don't become CloudWatch metrics. The distinction between dimensions and properties matters a lot; we'll cover it in the next section.

EMF Flow Overview

The full lifecycle: Lambda writes the JSON to stdout → CloudWatch Logs ingests it → EMF processor detects the _aws envelope → metrics are extracted and published to CloudWatch Metrics asynchronously, usually within about a minute.

The original log line is always kept. You get the CloudWatch metric and the full JSON context in Logs Insights. That's what makes EMF genuinely useful for debugging — MessageId and EventSource are right there, even though they're not dimensions.


Dimensions vs Properties — getting this right

Get this wrong and you'll have a CloudWatch bill problem before you have a useful metric.

Dimensions define the identity of a metric time series. Each unique combination of dimension values creates a separate metric in CloudWatch. They're what you filter and group by in dashboards and alarms.

Properties live only in the log. They add debugging context you can query with Logs Insights, but they don't affect your metric structure at all.

A practical rule: if you'd want to set a CloudWatch alarm that filters on this field, it's a dimension. If you'd only ever look for it during a specific incident, it's a property.

Good dimensions — low cardinality, meaningful aggregation:

metrics.putDimensions({
  Environment: "production",    // ~3-5 values
  Service: "order-processor",   // bounded set of services
  EventSource: "sqs",           // sqs | sns | eventbridge | api-gateway
});
Enter fullscreen mode Exit fullscreen mode

Properties — high cardinality, debugging context:

metrics.setProperty("MessageId", record.messageId);         // unbounded
metrics.setProperty("CorrelationId", event.correlationId);  // unbounded
metrics.setProperty("CustomerId", order.customerId);        // unbounded
metrics.setProperty("ErrorMessage", err.message);           // variable
Enter fullscreen mode Exit fullscreen mode

Never dimension on IDs, timestamps, or anything with more than a few dozen distinct values. The cost of getting this wrong is a metric explosion that shows up on your CloudWatch bill before it shows up as a useful alert.


Implementing EMF in AWS Lambda

Installation

npm install aws-embedded-metrics
Enter fullscreen mode Exit fullscreen mode

The official aws-embedded-metrics library is maintained by AWS. It handles the JSON construction, stdout writing, and flush lifecycle — it's the one you want.

metricScope vs createMetricsLogger — which to use?

The library gives you two patterns. metricScope is a decorator that wraps your handler and flushes automatically at the end of each invocation:

import { metricScope, Unit } from "aws-embedded-metrics";
import { SQSHandler } from "aws-lambda";

export const handler: SQSHandler = metricScope(
  (metrics) => async (event) => {
    for (const record of event.Records) {
      await processRecord(record, metrics);
    }
  }
);
Enter fullscreen mode Exit fullscreen mode

It's clean, but it emits a single aggregated flush for the entire invocation. If you're processing a batch of SQS records and want one metric emission per record — which gives you per-message granularity in Logs Insights — use createMetricsLogger instead:

import { createMetricsLogger, Unit } from "aws-embedded-metrics";
import { SQSHandler, SQSRecord } from "aws-lambda";

export const handler: SQSHandler = async (event) => {
  for (const record of event.Records) {
    await processRecord(record);
  }
};

async function processRecord(record: SQSRecord): Promise<void> {
  const metrics = createMetricsLogger();
  const startTime = Date.now();

  metrics.setNamespace("OrderService/Events");
  metrics.putDimensions({
    Environment: process.env.ENVIRONMENT ?? "unknown",
    Service: "order-processor",
    EventSource: "sqs",
  });

  // Properties — in the log, not in CloudWatch Metrics
  metrics.setProperty("MessageId", record.messageId);
  metrics.setProperty("FunctionVersion", process.env.AWS_LAMBDA_FUNCTION_VERSION ?? "unknown");

  // Emit before try/catch — this is your unconditional baseline
  metrics.putMetric("EventReceived", 1, Unit.Count);

  try {
    const payload = JSON.parse(record.body);
    metrics.setProperty("OrderId", payload.orderId);

    await fulfillOrder(payload);

    metrics.putMetric("EventProcessed", 1, Unit.Count);
    metrics.putMetric("EventProcessingDuration", Date.now() - startTime, Unit.Milliseconds);
  } catch (error) {
    const err = error as Error;

    metrics.setProperty("ErrorType", err.constructor.name);
    metrics.setProperty("ErrorMessage", err.message);

    metrics.putMetric("EventFailed", 1, Unit.Count);
    metrics.putMetric("EventProcessingDuration", Date.now() - startTime, Unit.Milliseconds);

    // Re-throw so SQS can retry or route to the DLQ
    throw error;
  } finally {
    // Always flush — even on the error path
    await metrics.flush();
  }
}

async function fulfillOrder(payload: unknown): Promise<void> {
  // Business logic
}
Enter fullscreen mode Exit fullscreen mode

EventReceived is emitted before the try block — intentionally. If an exception fires before you reach a metric inside the try, you lose the count with no indication anything happened. Emit it first, unconditionally. It's your baseline and it should always be there.

EventProcessingDuration is emitted on both the success and failure paths. That's not redundant — it's two different signals. Failures that are consistently fast are usually validation errors caught early. Failures that are consistently slow are usually timeouts or overloaded downstream services. You can't tell the difference if you only track one.

The finally block is non-negotiable. With createMetricsLogger, a missing flush is a silent metric drop — no exception, no warning, the metric just isn't there. I've been caught by this on a late-night deploy. Once you move the flush into finally, it's impossible to forget.


A more complete example

Here's how I'd structure this across a real pipeline, with a shared helper so every Lambda in the domain gets the same baseline by default.

// shared/metrics-helper.ts
import { MetricsLogger, Unit } from "aws-embedded-metrics";

export interface EventMetricsContext {
  messageId: string;
  eventSource: string;
  service: string;
}

export function initEventMetrics(
  metrics: MetricsLogger,
  namespace: string,
  ctx: EventMetricsContext
): void {
  metrics.setNamespace(namespace);
  metrics.putDimensions({
    Environment: process.env.ENVIRONMENT ?? "unknown",
    Service: ctx.service,
    EventSource: ctx.eventSource,
  });
  metrics.setProperty("MessageId", ctx.messageId);
  metrics.setProperty("FunctionName", process.env.AWS_LAMBDA_FUNCTION_NAME ?? "unknown");
  metrics.setProperty("FunctionVersion", process.env.AWS_LAMBDA_FUNCTION_VERSION ?? "unknown");
}

export { Unit };
Enter fullscreen mode Exit fullscreen mode
// functions/process-order/handler.ts
import { createMetricsLogger, Unit } from "aws-embedded-metrics";
import { SQSHandler, SQSRecord } from "aws-lambda";
import { initEventMetrics } from "../../shared/metrics-helper";

interface OrderEvent {
  orderId: string;
  customerId: string;
  items: Array<{ productId: string; quantity: number; unitPrice: number }>;
}

// Note: for-of instead of Promise.all — with SQS, if any record throws,
// Promise.all rejects the whole batch and all records get retried,
// including the ones that already succeeded.
export const handler: SQSHandler = async (event) => {
  for (const record of event.Records) {
    await processOrderRecord(record);
  }
};

async function processOrderRecord(record: SQSRecord): Promise<void> {
  const metrics = createMetricsLogger();
  const startTime = Date.now();

  // Base dimensions — shared across success and failure
  const baseDimensions = {
    Environment: process.env.ENVIRONMENT ?? "unknown",
    Service: "order-processor",
    EventSource: "sqs",
  };

  initEventMetrics(metrics, "OrderService/Events", {
    messageId: record.messageId,
    eventSource: "sqs",
    service: "order-processor",
  });

  metrics.putMetric("EventReceived", 1, Unit.Count);

  try {
    const orderEvent: OrderEvent = JSON.parse(record.body);

    metrics.setProperty("OrderId", orderEvent.orderId);
    metrics.setProperty("CustomerId", orderEvent.customerId);
    metrics.setProperty("ItemCount", orderEvent.items.length);

    const orderTotal = orderEvent.items.reduce(
      (sum, item) => sum + item.quantity * item.unitPrice,
      0
    );
    metrics.setProperty("OrderTotal", orderTotal);

    await validateAndFulfillOrder(orderEvent);

    metrics.putMetric("EventProcessed", 1, Unit.Count);
    metrics.putMetric("OrderFulfilled", 1, Unit.Count);
    // Unit.None is correct for monetary values — CloudWatch has no currency unit.
    // Sum and SampleCount still work; just be explicit in your dashboard labels.
    metrics.putMetric("OrderValue", orderTotal, Unit.None);
    metrics.putMetric("EventProcessingDuration", Date.now() - startTime, Unit.Milliseconds);
  } catch (error) {
    const err = error as Error;

    metrics.setProperty("FailureType", err.constructor.name);
    metrics.setProperty("FailureReason", err.message);

    // EventFailed uses the same base dimensions as EventProcessed so they
    // can be queried together (e.g. EventFailed / EventReceived for failure rate).
    metrics.putMetric("EventFailed", 1, Unit.Count);
    metrics.putMetric("EventProcessingDuration", Date.now() - startTime, Unit.Milliseconds);

    // Separate metric with FailureCategory dimension for breakdown dashboards.
    // This is intentionally a second flush with a different dimension set —
    // mixing it into the base dimensions would break aggregation with EventProcessed.
    const failureCategory = categoriseError(err);
    const categoryMetrics = createMetricsLogger();
    categoryMetrics.setNamespace("OrderService/Events");
    categoryMetrics.putDimensions({ ...baseDimensions, FailureCategory: failureCategory });
    categoryMetrics.putMetric("EventFailedByCategory", 1, Unit.Count);
    await categoryMetrics.flush();

    throw error;
  } finally {
    await metrics.flush();
  }
}

type FailureCategory =
  | "ValidationError"
  | "InventoryError"
  | "PaymentError"
  | "DownstreamTimeout"
  | "UnknownError";

function categoriseError(err: Error): FailureCategory {
  if (err.name === "ValidationError") return "ValidationError";
  if (err.name === "InsufficientInventoryError") return "InventoryError";
  if (err.name === "PaymentDeclinedError") return "PaymentError";
  if (err.name === "TimeoutError") return "DownstreamTimeout";
  return "UnknownError";
}

async function validateAndFulfillOrder(order: OrderEvent): Promise<void> {
  // Inventory check, payment charge, fulfilment dispatch
  void order;
}
Enter fullscreen mode Exit fullscreen mode

The two-flush pattern on the failure path is worth understanding. EventFailed goes out under the base dimensions {Environment, Service, EventSource} — the same set as EventProcessed. This means you can write a CloudWatch Metric Math expression like EventFailed / EventReceived and get a meaningful failure rate without dimension mismatches. The EventFailedByCategory metric gets its own logger with FailureCategory as an additional dimension, purely for breakdown dashboards. Mixing both into a single dimension set would make EventProcessed and EventFailed incomparable in aggregation queries.

With this in place, a CloudWatch dashboard for this pipeline shows:

  • EventReceived vs EventProcessed over time — is the pipeline moving?
  • EventFailed / EventReceived as a Metric Math expression — your failure rate
  • EventFailedByCategory broken down by FailureCategory — which failure mode dominates?
  • P50/P95/P99 of EventProcessingDuration — where is time going?
  • OrderFulfilled and OrderValue for the business side of the house

And the alarms that actually matter:

  • EventFailed / EventReceived exceeding 2% over a 5-minute window
  • Zero EventProcessed for 10 minutes — a dead pipeline alarm
  • EventProcessingDuration P95 breaching your SLA threshold

Things I do now that I didn't do before

Most of these came from incidents, not documentation.

Always emit EventReceived before the try/catch. Not inside it — before it. An exception before your metric emit means you lose the count silently. That missing count is the thing that makes the EventReceived vs EventProcessed gap impossible to interpret — you don't know if the gap is failures or events that never registered at all.

Flush in finally, always. With createMetricsLogger, a missing flush drops your metrics with no error. I've been bitten by this on a late-night deploy — spent 20 minutes wondering why the metrics weren't appearing before finding the flush inside the try block instead of finally. The fix is one line, the discovery is not.

Emit duration on both paths. Success duration and failure duration are different signals. Fast failures are usually validation errors caught early. Slow failures are usually timeouts or an overloaded downstream. Track both separately and you can tell the difference at a glance.

Keep your observability code consistent across the domain. The pattern of setting namespace, dimensions, and common properties repeats across every handler. Extract it into a shared helper once — initEventMetrics or similar. Every new Lambda in the same domain picks it up by default, dimension keys stay consistent, and OrderService/Events doesn't drift into per-function naming every time someone adds a new handler.

Set a dead pipeline alarm. Zero EventProcessed for 10 minutes on a pipeline that should always be moving is one of the most valuable alarms I've ever configured. Most alert systems catch the system when it's noisy. This one catches it when it goes quiet — which is how the payment pipeline failure in my intro story went undetected for hours.

Test that your metrics actually emit. This one almost nobody does, and it's the gap that hurts most after a deployment. The library supports AWS_EMF_ENVIRONMENT=Local, which writes to stdout instead of CloudWatch. Set that in your test environment, capture stdout, and assert on it:

// In your test setup
process.env.AWS_EMF_ENVIRONMENT = "Local";

// In your test
const logs: string[] = [];
jest.spyOn(process.stdout, "write").mockImplementation((chunk) => {
  logs.push(chunk.toString());
  return true;
});

await handler(mockSQSEvent, mockContext, jest.fn());

const emfLog = logs
  .flatMap((l) => { try { return [JSON.parse(l)]; } catch { return []; } })
  .find((l) => l._aws);

expect(emfLog.EventReceived).toBe(1);
expect(emfLog.EventProcessed).toBe(1);
expect(emfLog._aws.CloudWatchMetrics[0].Namespace).toBe("OrderService/Events");
Enter fullscreen mode Exit fullscreen mode

This catches the two most common mistakes: a metric being emitted with the wrong name, and a flush not being called at all. Both are invisible until they hit production.


Quick comparison

EMF (aws-embedded-metrics) PutMetricData
Extra API call No Yes
Extra IAM permission No Yes (cloudwatch:PutMetricData)
Latency impact None Yes — synchronous HTTPS call
Error handling Malformed payloads dropped silently; valid payloads always delivered Requires explicit retry logic for throttles and transient errors
Per-record granularity Yes — one flush per record Yes — one call per batch
Best for Lambda and container workloads writing to CloudWatch Logs Non-Lambda workloads or systems not using CloudWatch Logs

One thing the table doesn't show: EMF doesn't give you free error handling. If your JSON payload is malformed — wrong field types, missing required keys — CloudWatch drops it silently with no indication. Valid payloads are always delivered; it's your responsibility to keep them valid. The library handles this as long as you use the API correctly, but it's not the same as PutMetricData retrying on throttles.

PutMetricData has a legitimate place in architectures that don't write to CloudWatch Logs at all — long-running EC2 processes, on-prem agents, batch jobs outside Lambda. In Lambda, it's the harder path with no upside.


Personal thoughts

I was skeptical of EMF the first time I saw it. Writing metrics to stdout felt like a workaround, not a real solution. It took reading the spec to understand that the log stream was the point — not a limitation.

One thing worth knowing before you ship this: EMF metrics can lag up to a minute between log ingestion and appearance in CloudWatch Metrics. That's fine for dashboards and alerting on trends, but if you're thinking about using these metrics to trigger automated remediation with sub-minute reaction time, they won't get you there. For that use case, you'd still need PutMetricData or a different mechanism. For everything else — monitoring, alarms, post-incident analysis — the lag is a non-issue.

If you take one thing from this article: add EventReceived, EventProcessed, and EventFailed to your most critical Lambda this week. Set the dead pipeline alarm. It takes less than an hour, and the next time something goes wrong at midnight, you'll know exactly where to look.


Found this useful? Follow for more AWS and serverless content. Questions and production war stories welcome in the comments.


References:

Top comments (0)