Observability as Code (OaC) is an important concept. It allows you to define and manage observability in the same way you manage Infrastructure as Code (IaC), providing many long-term benefits.
The Challenge with Manual Observability
Take AWS Lambda as an example. You can manually instrument a Lambda function once — configure logging to Datadog, create monitors, dashboards, and alerts. This may work initially, but problems begin to appear when:
- Someone deploys to a new environment
- A new developer joins the team
- A monitor is missed during setup
- Configurations go out of sync across environments
- Production and non-production environments drift over time
All of these situations require significant manual effort and increase operational risk.
How Observability as Code Solves This
Observability as Code solves these problems by keeping every monitor, metric tag, alert, and trace rule in version control alongside the application code.
When you deploy your stack, the entire observability layer is deployed automatically as part of the same workflow. This ensures consistency, repeatability, and governance across all environments.
Benefits of Observability as Code
- Consistency across environments
- Version-controlled observability configurations
- No configuration drift in monitors and alerts
- Reproducibility across multiple environments
- Peer review and governance capabilities
- Atomic deployments with the application
- Easier disaster recovery since observability lives in the codebase
- Reduced learning curve through self-documenting configurations
- Standardization that helps control operational costs
- Improved scalability as systems and teams grow
As part of this blog post, I will walk you through exactly how to build this for a Node.js Lambda API using the Serverless Framework and the Datadog plugin. We’ll discuss how to enable APM tracing, structure logs with trace correlation, and create custom monitors for error rate and latency monitoring. We’ll also cover best practices, ways to control costs, and some lessons learned that you can apply in your own implementations.
Sample App
Developped a simple REST API - Trip cost estimator - deployed as a single Lambda function behind AWS HTTP AI gateway. It accept days, people, and accommodation tier, and returns a cost breakdown.
POST /estimate
{ "days": 7, "people": 2, "accommodation": "mid-range" }
The Serverless Framework + Datadog Stack
The Datadog plugin for Serverless Framework (serverless-plugin-datadog) is the core of this setup. When you run sls deploy, the plugin:
- Attaches the Datadog Lambda Library layer — instruments your Node.js runtime for APM
- Attaches the Datadog Lambda Extension layer — a sidecar process that runs alongside your function and ships metrics, traces, and logs directly to Datadog without a separate Forwarder Lambda
- Injects all required environment variables automatically (
DD_SITE,DD_SERVICE,DD_ENV, etc.) - Creates or updates Datadog monitors defined in your
serverless.yml
Example of a complete serverless.yml:
service: trip-estimator
useDotenv: true
provider:
name: aws
runtime: nodejs20.x
region: us-east-1
stage: dev
logRetentionInDays: 1
environment:
DD_API_KEY: ${env:DD_API_KEY}
DD_SITE: ${env:DD_SITE, 'us5.datadoghq.com'}
DD_ENV: ${sls:stage}
DD_SERVICE: trip-estimator
DD_VERSION: "1.0.0"
DD_LOGS_INJECTION: "true"
functions:
tripCostEstimator:
handler: handler.estimate
events:
- httpApi:
path: /estimate
method: POST
plugins:
- serverless-plugin-datadog
custom:
datadog:
apiKey: ${env:DD_API_KEY}
appKey: ${env:DD_APP_KEY}
site: ${env:DD_SITE, 'us5.datadoghq.com'}
env: ${sls:stage}
service: -estimator
version: "1.0.0"
enableDDLogs: true
enableDDTracing: true
enableEnhancedMetrics: true
captureLambdaPayload: true
addLayers: true
monitors:
- lambda-high-error-rate:
...
- lambda-high-p90-latency:
...
Credentials live in a .env file that is never committed to source control. But best approach is to leverage AWS Secret Manager.
DD_API_KEY=your_api_key_here
DD_APP_KEY=your_app_key_here
DD_SITE=us5.datadoghq.com
Enabling APM: Traces
The plugin handles Lambda layer attachment automatically. Once the layers are in place, every Lambda invocation produces a trace in Datadog APM with no code changes required.
For custom spans — to trace specific business logic within a request — you require dd-trace directly in your handler. At Lambda runtime, dd-trace is provided by the Datadog Lambda Library layer in /opt/nodejs/node_modules, so no bundling is needed. For local development, add it as a devDependency:
{
"devDependencies": {
"dd-trace": "^5.x",
"serverless-plugin-datadog": "^5.x"
}
}
In the handler, require it with a graceful fallback so local testing still works without the layer:
let tracer;
try {
tracer = require('dd-trace');
} catch {
tracer = null;
}
Then wrap business-critical logic in a custom span:
let summary;
if (tracer) {
summary = tracer.trace('trip.estimate', {
tags: { days, people, accommodation }
}, (span) => {
const result = calculate(days, people, accommodation);
span.setTag('grand_total', result.grand_total);
span.setTag('per_person_total', result.per_person_total);
return result;
});
} else {
summary = calculate(days, people, accommodation);
}
In Datadog APM you will now see:
- An auto-instrumented root span for the entire Lambda invocation
- A child span
trip.estimatetagged with your business attributes - End-to-end latency, error rate, and throughput per service
Enabling Logs with Trace Correlation
The Datadog Lambda Extension subscribes to the Lambda Telemetry API and forwards all function logs directly to Datadog. Set enableDDLogs: true in the plugin config and logs flow automatically.
The real power comes from trace correlation — embedding the active trace ID and span ID inside every log line. When you click a log in Datadog, you can jump directly to its trace, and vice versa.
Here is the structured logging helper we used:
const SERVICE = 'sri-lanka-trip-estimator';
function log(level, message, data = {}) {
const entry = {
level,
message,
service: SERVICE,
env: process.env.DD_ENV || 'local',
timestamp: new Date().toISOString(),
...data,
};
if (tracer) {
const span = tracer.scope().active();
if (span) {
entry['dd.trace_id'] = span.context().toTraceId();
entry['dd.span_id'] = span.context().toSpanId();
}
}
console.log(JSON.stringify(entry));
}
Every invocation now emits structured JSON logs across the full request lifecycle:
log('info', 'Incoming request', { method, path, sourceIp });
log('debug', 'Validating input', { days, people, accommodation });
log('info', 'Input validated, calculating estimate', { days, people, accommodation });
log('warn', 'Validation failed: invalid days', { days }); // bad input
log('error', 'Failed to parse request body', { rawBody }); // parse error
log('info', 'Estimate calculated', { grand_total, per_person_total });
A real log line in Datadog looks like this:
The dd.trace_id field is what links this log to its trace in APM. In the Datadog UI, the correlation is automatic — no manual configuration required once the field is present.
Datadog Plugin Capabilities Summary
| Capability | Plugin Config | What You Get |
|---|---|---|
| APM Tracing | enableDDTracing: true |
Auto-instrumented spans per invocation, custom spans via dd-trace
|
| Log forwarding | enableDDLogs: true |
Logs shipped directly to Datadog via the Lambda Extension |
| Enhanced metrics | enableEnhancedMetrics: true |
aws.lambda.enhanced.* — invocations, errors, duration, cold starts, memory |
| Payload capture | captureLambdaPayload: true |
Request and response bodies captured as span metadata |
| Monitors | monitors: [...] |
Monitors created and updated in Datadog on every sls deploy
|
| Tagging |
env, service, version
|
Unified Service Tagging across logs, traces, and metrics |
The plugin uses two Lambda layers:
-
Datadog-Node20-x— the Node.js tracer library -
Datadog-Extension— the agent sidecar for direct metric and log shipping
Monitors as Code
This is where observability as code really shines. Monitors are defined directly in serverless.yml under custom.datadog.monitors. The plugin calls the Datadog Monitors API during deployment to create or update them. Remove a monitor from the config and it is deleted from Datadog on the next deploy. The entire monitor lifecycle is driven by code.
You need the Datadog Application Key (appKey) in addition to the API key for monitor management.
Error Rate Monitor
Alerts when more than 5% of invocations result in an error over the last 5 minutes:
monitors:
- lambda-high-error-rate:
name: "High Error Rate — Sri Lanka Trip Estimator (${sls:stage})"
type: "metric alert"
query: >-
sum(last_5m):
sum:aws.lambda.enhanced.errors{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
/
sum:aws.lambda.enhanced.invocations{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
> 0.05
message: |
Lambda error rate exceeded 5% for trip-estimator in {{env.name}}.
- Traces: https://app.us5.datadoghq.com/apm/traces?service=sri-lanka-trip-estimator
- Logs: https://app.us5.datadoghq.com/logs?query=service:sri-lanka-trip-estimator status:error
options:
thresholds:
critical: 0.05
warning: 0.02
notify_no_data: false
require_full_window: false
evaluation_delay: 60
P90 Latency Monitor
Alerts when the 90th-percentile response time exceeds 1 second:
- lambda-high-p90-latency:
name: "High P90 Latency — Sri Lanka Trip Estimator (${sls:stage})"
type: "metric alert"
query: >-
avg(last_10m):
p90:aws.lambda.duration{service:trip-estimator,env:${sls:stage}}
> 1000
message: |
Lambda P90 latency exceeded 1000ms for sri-lanka-trip-estimator in {{env.name}}.
Possible causes: cold starts, downstream bottleneck, memory pressure.
options:
thresholds:
critical: 1000
warning: 800
notify_no_data: false
require_full_window: false
evaluation_delay: 60
Best Practices: The CloudWatch Log Question
Lambda always writes logs to CloudWatch. This is one of the most common questions when teams move to a dedicated observability platform like Datadog:
Can we stop logs going to CloudWatch and have them only in Datadog?*
General option you will think of to achieve this will not work.
Deny on the Lambda execution role
The natural first instinct. Add a Deny statement to the IAM role that Lambda uses:
provider:
iam:
role:
statements:
- Effect: Deny
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: "*"
Result: did not work. Logs continued flowing to CloudWatch unchanged.
Why: Lambda's logging subsystem is managed by the AWS Lambda platform itself — not by your function's execution role. The platform writes logs through internal AWS infrastructure that is completely outside the IAM evaluation chain for your execution role. The logs:* permissions in the execution role are only relevant if your application code directly calls the CloudWatch Logs API. Lambda's automatic log delivery bypasses them entirely.
CloudWatch Log Group resource policy
Resource policies apply at the resource level rather than the identity level, so if you try denying lambda.amazonaws.com directly on the log group:
resources:
Resources:
DenyLambdaCloudWatchWrites:
Type: AWS::Logs::ResourcePolicy
Properties:
PolicyName: deny-lambda-cw-writes-dev
PolicyDocument: >
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Principal": { "Service": "lambda.amazonaws.com" },
"Action": ["logs:CreateLogStream", "logs:PutLogEvents"],
"Resource": "arn:aws:logs:us-east-1:ACCOUNT:log-group:/aws/lambda/FUNCTION:*"
}]
}
Still this will not work. Logs still appeared in CloudWatch after every invocation.
Why: The Lambda platform does not present lambda.amazonaws.com as its principal when writing logs. It uses an internal AWS service mechanism that does not correspond to any addressable IAM principal. There is no resource policy you can write to block it.
What is the work around? You don't need to pay for both AWS and Datadog for your logs.
The pragmatic answer: set log retention to 1 day.
provider:
logRetentionInDays: 1
One line. Serverless Framework sets the retention policy on the CloudWatch log group during deployment. Logs land in CloudWatch as always, but are automatically purged after 24 hours. Datadog retains them for as long as your plan allows.
This is the pattern used in production by teams that run Datadog as their primary observability platform. CloudWatch becomes a 24-hour emergency buffer — useful if Datadog has an outage — and costs almost nothing at that retention window.
The Datadog Lambda Extension intercepts logs via the Lambda Telemetry API and ships them to Datadog in real time. CloudWatch receives the same logs as an automatic platform behaviour that cannot be disabled, but we set 1-day retention to minimise cost and keep CloudWatch as a short-term backup. All operational visibility lives in Datadog.
What we have now is an entire observability layer which is verion contorl, peer reviewed and eployed automically wtih the application. True Observbailty as Code set up.




Top comments (0)