Indika_Wimalasuriya for AWS Community Builders

Posted on May 14

OaC for AWS Lambda: Serverless Framework + Datadog

Observability as Code (OaC) is an important concept. It allows you to define and manage observability in the same way you manage Infrastructure as Code (IaC), providing many long-term benefits.

The Challenge with Manual Observability

Take AWS Lambda as an example. You can manually instrument a Lambda function once — configure logging to Datadog, create monitors, dashboards, and alerts. This may work initially, but problems begin to appear when:

Someone deploys to a new environment
A new developer joins the team
A monitor is missed during setup
Configurations go out of sync across environments
Production and non-production environments drift over time

All of these situations require significant manual effort and increase operational risk.

How Observability as Code Solves This

Observability as Code solves these problems by keeping every monitor, metric tag, alert, and trace rule in version control alongside the application code.

When you deploy your stack, the entire observability layer is deployed automatically as part of the same workflow. This ensures consistency, repeatability, and governance across all environments.

Benefits of Observability as Code

Consistency across environments
Version-controlled observability configurations
No configuration drift in monitors and alerts
Reproducibility across multiple environments
Peer review and governance capabilities
Atomic deployments with the application
Easier disaster recovery since observability lives in the codebase
Reduced learning curve through self-documenting configurations
Standardization that helps control operational costs
Improved scalability as systems and teams grow

As part of this blog post, I will walk you through exactly how to build this for a Node.js Lambda API using the Serverless Framework and the Datadog plugin. We’ll discuss how to enable APM tracing, structure logs with trace correlation, and create custom monitors for error rate and latency monitoring. We’ll also cover best practices, ways to control costs, and some lessons learned that you can apply in your own implementations.

Sample App

Developped a simple REST API - Trip cost estimator - deployed as a single Lambda function behind AWS HTTP AI gateway. It accept days, people, and accommodation tier, and returns a cost breakdown.

POST /estimate
{ "days": 7, "people": 2, "accommodation": "mid-range" }

The Serverless Framework + Datadog Stack

The Datadog plugin for Serverless Framework (serverless-plugin-datadog) is the core of this setup. When you run sls deploy, the plugin:

Attaches the Datadog Lambda Library layer — instruments your Node.js runtime for APM
Attaches the Datadog Lambda Extension layer — a sidecar process that runs alongside your function and ships metrics, traces, and logs directly to Datadog without a separate Forwarder Lambda
Injects all required environment variables automatically (DD_SITE, DD_SERVICE, DD_ENV, etc.)
Creates or updates Datadog monitors defined in your serverless.yml

Example of a complete serverless.yml:

service: trip-estimator

useDotenv: true

provider:
  name: aws
  runtime: nodejs20.x
  region: us-east-1
  stage: dev
  logRetentionInDays: 1
  environment:
    DD_API_KEY: ${env:DD_API_KEY}
    DD_SITE: ${env:DD_SITE, 'us5.datadoghq.com'}
    DD_ENV: ${sls:stage}
    DD_SERVICE: trip-estimator
    DD_VERSION: "1.0.0"
    DD_LOGS_INJECTION: "true"

functions:
  tripCostEstimator:
    handler: handler.estimate
    events:
      - httpApi:
          path: /estimate
          method: POST

plugins:
  - serverless-plugin-datadog

custom:
  datadog:
    apiKey: ${env:DD_API_KEY}
    appKey: ${env:DD_APP_KEY}
    site: ${env:DD_SITE, 'us5.datadoghq.com'}
    env: ${sls:stage}
    service: -estimator
    version: "1.0.0"
    enableDDLogs: true
    enableDDTracing: true
    enableEnhancedMetrics: true
    captureLambdaPayload: true
    addLayers: true
    monitors:
      - lambda-high-error-rate:
          ...
      - lambda-high-p90-latency:
          ...

Credentials live in a .env file that is never committed to source control. But best approach is to leverage AWS Secret Manager.

DD_API_KEY=your_api_key_here
DD_APP_KEY=your_app_key_here
DD_SITE=us5.datadoghq.com

Enabling APM: Traces

The plugin handles Lambda layer attachment automatically. Once the layers are in place, every Lambda invocation produces a trace in Datadog APM with no code changes required.

For custom spans — to trace specific business logic within a request — you require dd-trace directly in your handler. At Lambda runtime, dd-trace is provided by the Datadog Lambda Library layer in /opt/nodejs/node_modules, so no bundling is needed. For local development, add it as a devDependency:

{
  "devDependencies": {
    "dd-trace": "^5.x",
    "serverless-plugin-datadog": "^5.x"
  }
}

In the handler, require it with a graceful fallback so local testing still works without the layer:

let tracer;
try {
  tracer = require('dd-trace');
} catch {
  tracer = null;
}

Then wrap business-critical logic in a custom span:

let summary;
if (tracer) {
  summary = tracer.trace('trip.estimate', {
    tags: { days, people, accommodation }
  }, (span) => {
    const result = calculate(days, people, accommodation);
    span.setTag('grand_total', result.grand_total);
    span.setTag('per_person_total', result.per_person_total);
    return result;
  });
} else {
  summary = calculate(days, people, accommodation);
}

In Datadog APM you will now see:

An auto-instrumented root span for the entire Lambda invocation
A child span trip.estimate tagged with your business attributes
End-to-end latency, error rate, and throughput per service

Enabling Logs with Trace Correlation

The Datadog Lambda Extension subscribes to the Lambda Telemetry API and forwards all function logs directly to Datadog. Set enableDDLogs: true in the plugin config and logs flow automatically.

The real power comes from trace correlation — embedding the active trace ID and span ID inside every log line. When you click a log in Datadog, you can jump directly to its trace, and vice versa.

Here is the structured logging helper we used:

const SERVICE = 'sri-lanka-trip-estimator';

function log(level, message, data = {}) {
  const entry = {
    level,
    message,
    service: SERVICE,
    env: process.env.DD_ENV || 'local',
    timestamp: new Date().toISOString(),
    ...data,
  };
  if (tracer) {
    const span = tracer.scope().active();
    if (span) {
      entry['dd.trace_id'] = span.context().toTraceId();
      entry['dd.span_id'] = span.context().toSpanId();
    }
  }
  console.log(JSON.stringify(entry));
}

Every invocation now emits structured JSON logs across the full request lifecycle:

log('info', 'Incoming request', { method, path, sourceIp });
log('debug', 'Validating input', { days, people, accommodation });
log('info', 'Input validated, calculating estimate', { days, people, accommodation });
log('warn', 'Validation failed: invalid days', { days });           // bad input
log('error', 'Failed to parse request body', { rawBody });          // parse error
log('info', 'Estimate calculated', { grand_total, per_person_total });

A real log line in Datadog looks like this:

The dd.trace_id field is what links this log to its trace in APM. In the Datadog UI, the correlation is automatic — no manual configuration required once the field is present.

Datadog Plugin Capabilities Summary

Capability	Plugin Config	What You Get
APM Tracing	`enableDDTracing: true`	Auto-instrumented spans per invocation, custom spans via `dd-trace`
Log forwarding	`enableDDLogs: true`	Logs shipped directly to Datadog via the Lambda Extension
Enhanced metrics	`enableEnhancedMetrics: true`	`aws.lambda.enhanced.*` — invocations, errors, duration, cold starts, memory
Payload capture	`captureLambdaPayload: true`	Request and response bodies captured as span metadata
Monitors	`monitors: [...]`	Monitors created and updated in Datadog on every `sls deploy`
Tagging	`env`, `service`, `version`	Unified Service Tagging across logs, traces, and metrics

The plugin uses two Lambda layers:

Datadog-Node20-x — the Node.js tracer library
Datadog-Extension — the agent sidecar for direct metric and log shipping

Monitors as Code

This is where observability as code really shines. Monitors are defined directly in serverless.yml under custom.datadog.monitors. The plugin calls the Datadog Monitors API during deployment to create or update them. Remove a monitor from the config and it is deleted from Datadog on the next deploy. The entire monitor lifecycle is driven by code.

You need the Datadog Application Key (appKey) in addition to the API key for monitor management.

Error Rate Monitor

Alerts when more than 5% of invocations result in an error over the last 5 minutes:

monitors:
  - lambda-high-error-rate:
      name: "High Error Rate — Sri Lanka Trip Estimator (${sls:stage})"
      type: "metric alert"
      query: >-
        sum(last_5m):
        sum:aws.lambda.enhanced.errors{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
        /
        sum:aws.lambda.enhanced.invocations{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
        > 0.05
      message: |
        Lambda error rate exceeded 5% for trip-estimator in {{env.name}}.

        - Traces: https://app.us5.datadoghq.com/apm/traces?service=sri-lanka-trip-estimator
        - Logs: https://app.us5.datadoghq.com/logs?query=service:sri-lanka-trip-estimator status:error
      options:
        thresholds:
          critical: 0.05
          warning: 0.02
        notify_no_data: false
        require_full_window: false
        evaluation_delay: 60

P90 Latency Monitor

Alerts when the 90th-percentile response time exceeds 1 second:

  - lambda-high-p90-latency:
      name: "High P90 Latency — Sri Lanka Trip Estimator (${sls:stage})"
      type: "metric alert"
      query: >-
        avg(last_10m):
        p90:aws.lambda.duration{service:trip-estimator,env:${sls:stage}}
        > 1000
      message: |
        Lambda P90 latency exceeded 1000ms for sri-lanka-trip-estimator in {{env.name}}.

        Possible causes: cold starts, downstream bottleneck, memory pressure.
      options:
        thresholds:
          critical: 1000
          warning: 800
        notify_no_data: false
        require_full_window: false
        evaluation_delay: 60

Best Practices: The CloudWatch Log Question

Lambda always writes logs to CloudWatch. This is one of the most common questions when teams move to a dedicated observability platform like Datadog:

Can we stop logs going to CloudWatch and have them only in Datadog?*

General option you will think of to achieve this will not work.

Deny on the Lambda execution role

The natural first instinct. Add a Deny statement to the IAM role that Lambda uses:

provider:
  iam:
    role:
      statements:
        - Effect: Deny
          Action:
            - logs:CreateLogGroup
            - logs:CreateLogStream
            - logs:PutLogEvents
          Resource: "*"

Result: did not work. Logs continued flowing to CloudWatch unchanged.

Why: Lambda's logging subsystem is managed by the AWS Lambda platform itself — not by your function's execution role. The platform writes logs through internal AWS infrastructure that is completely outside the IAM evaluation chain for your execution role. The logs:* permissions in the execution role are only relevant if your application code directly calls the CloudWatch Logs API. Lambda's automatic log delivery bypasses them entirely.

CloudWatch Log Group resource policy

Resource policies apply at the resource level rather than the identity level, so if you try denying lambda.amazonaws.com directly on the log group:

resources:
  Resources:
    DenyLambdaCloudWatchWrites:
      Type: AWS::Logs::ResourcePolicy
      Properties:
        PolicyName: deny-lambda-cw-writes-dev
        PolicyDocument: >
          {
            "Version": "2012-10-17",
            "Statement": [{
              "Effect": "Deny",
              "Principal": { "Service": "lambda.amazonaws.com" },
              "Action": ["logs:CreateLogStream", "logs:PutLogEvents"],
              "Resource": "arn:aws:logs:us-east-1:ACCOUNT:log-group:/aws/lambda/FUNCTION:*"
            }]
          }

Still this will not work. Logs still appeared in CloudWatch after every invocation.

Why: The Lambda platform does not present lambda.amazonaws.com as its principal when writing logs. It uses an internal AWS service mechanism that does not correspond to any addressable IAM principal. There is no resource policy you can write to block it.

What is the work around? You don't need to pay for both AWS and Datadog for your logs.

The pragmatic answer: set log retention to 1 day.

provider:
  logRetentionInDays: 1

One line. Serverless Framework sets the retention policy on the CloudWatch log group during deployment. Logs land in CloudWatch as always, but are automatically purged after 24 hours. Datadog retains them for as long as your plan allows.

This is the pattern used in production by teams that run Datadog as their primary observability platform. CloudWatch becomes a 24-hour emergency buffer — useful if Datadog has an outage — and costs almost nothing at that retention window.

The Datadog Lambda Extension intercepts logs via the Lambda Telemetry API and ships them to Datadog in real time. CloudWatch receives the same logs as an automatic platform behaviour that cannot be disabled, but we set 1-day retention to minimise cost and keep CloudWatch as a short-term backup. All operational visibility lives in Datadog.

What we have now is an entire observability layer which is verion contorl, peer reviewed and eployed automically wtih the application. True Observbailty as Code set up.

DEV Community

OaC for AWS Lambda: Serverless Framework + Datadog

Top comments (0)