DEV Community: Indika_Wimalasuriya

OaC for AWS Lambda: Serverless Framework + Datadog

Indika_Wimalasuriya — Thu, 14 May 2026 15:57:41 +0000

Observability as Code (OaC) is an important concept. It allows you to define and manage observability in the same way you manage Infrastructure as Code (IaC), providing many long-term benefits.

The Challenge with Manual Observability

Take AWS Lambda as an example. You can manually instrument a Lambda function once — configure logging to Datadog, create monitors, dashboards, and alerts. This may work initially, but problems begin to appear when:

Someone deploys to a new environment
A new developer joins the team
A monitor is missed during setup
Configurations go out of sync across environments
Production and non-production environments drift over time

All of these situations require significant manual effort and increase operational risk.

How Observability as Code Solves This

Observability as Code solves these problems by keeping every monitor, metric tag, alert, and trace rule in version control alongside the application code.

When you deploy your stack, the entire observability layer is deployed automatically as part of the same workflow. This ensures consistency, repeatability, and governance across all environments.

Benefits of Observability as Code

Consistency across environments
Version-controlled observability configurations
No configuration drift in monitors and alerts
Reproducibility across multiple environments
Peer review and governance capabilities
Atomic deployments with the application
Easier disaster recovery since observability lives in the codebase
Reduced learning curve through self-documenting configurations
Standardization that helps control operational costs
Improved scalability as systems and teams grow

As part of this blog post, I will walk you through exactly how to build this for a Node.js Lambda API using the Serverless Framework and the Datadog plugin. We’ll discuss how to enable APM tracing, structure logs with trace correlation, and create custom monitors for error rate and latency monitoring. We’ll also cover best practices, ways to control costs, and some lessons learned that you can apply in your own implementations.

Sample App

Developped a simple REST API - Trip cost estimator - deployed as a single Lambda function behind AWS HTTP AI gateway. It accept days, people, and accommodation tier, and returns a cost breakdown.

POST /estimate
{ "days": 7, "people": 2, "accommodation": "mid-range" }

The Serverless Framework + Datadog Stack

The Datadog plugin for Serverless Framework (serverless-plugin-datadog) is the core of this setup. When you run sls deploy, the plugin:

Attaches the Datadog Lambda Library layer — instruments your Node.js runtime for APM
Attaches the Datadog Lambda Extension layer — a sidecar process that runs alongside your function and ships metrics, traces, and logs directly to Datadog without a separate Forwarder Lambda
Injects all required environment variables automatically (DD_SITE, DD_SERVICE, DD_ENV, etc.)
Creates or updates Datadog monitors defined in your serverless.yml

Example of a complete serverless.yml:

service: trip-estimator

useDotenv: true

provider:
  name: aws
  runtime: nodejs20.x
  region: us-east-1
  stage: dev
  logRetentionInDays: 1
  environment:
    DD_API_KEY: ${env:DD_API_KEY}
    DD_SITE: ${env:DD_SITE, 'us5.datadoghq.com'}
    DD_ENV: ${sls:stage}
    DD_SERVICE: trip-estimator
    DD_VERSION: "1.0.0"
    DD_LOGS_INJECTION: "true"

functions:
  tripCostEstimator:
    handler: handler.estimate
    events:
      - httpApi:
          path: /estimate
          method: POST

plugins:
  - serverless-plugin-datadog

custom:
  datadog:
    apiKey: ${env:DD_API_KEY}
    appKey: ${env:DD_APP_KEY}
    site: ${env:DD_SITE, 'us5.datadoghq.com'}
    env: ${sls:stage}
    service: -estimator
    version: "1.0.0"
    enableDDLogs: true
    enableDDTracing: true
    enableEnhancedMetrics: true
    captureLambdaPayload: true
    addLayers: true
    monitors:
      - lambda-high-error-rate:
          ...
      - lambda-high-p90-latency:
          ...

Credentials live in a .env file that is never committed to source control. But best approach is to leverage AWS Secret Manager.

DD_API_KEY=your_api_key_here
DD_APP_KEY=your_app_key_here
DD_SITE=us5.datadoghq.com

Enabling APM: Traces

The plugin handles Lambda layer attachment automatically. Once the layers are in place, every Lambda invocation produces a trace in Datadog APM with no code changes required.

For custom spans — to trace specific business logic within a request — you require dd-trace directly in your handler. At Lambda runtime, dd-trace is provided by the Datadog Lambda Library layer in /opt/nodejs/node_modules, so no bundling is needed. For local development, add it as a devDependency:

{
  "devDependencies": {
    "dd-trace": "^5.x",
    "serverless-plugin-datadog": "^5.x"
  }
}

In the handler, require it with a graceful fallback so local testing still works without the layer:

let tracer;
try {
  tracer = require('dd-trace');
} catch {
  tracer = null;
}

Then wrap business-critical logic in a custom span:

let summary;
if (tracer) {
  summary = tracer.trace('trip.estimate', {
    tags: { days, people, accommodation }
  }, (span) => {
    const result = calculate(days, people, accommodation);
    span.setTag('grand_total', result.grand_total);
    span.setTag('per_person_total', result.per_person_total);
    return result;
  });
} else {
  summary = calculate(days, people, accommodation);
}

In Datadog APM you will now see:

An auto-instrumented root span for the entire Lambda invocation
A child span trip.estimate tagged with your business attributes
End-to-end latency, error rate, and throughput per service

Enabling Logs with Trace Correlation

The Datadog Lambda Extension subscribes to the Lambda Telemetry API and forwards all function logs directly to Datadog. Set enableDDLogs: true in the plugin config and logs flow automatically.

The real power comes from trace correlation — embedding the active trace ID and span ID inside every log line. When you click a log in Datadog, you can jump directly to its trace, and vice versa.

Here is the structured logging helper we used:

const SERVICE = 'sri-lanka-trip-estimator';

function log(level, message, data = {}) {
  const entry = {
    level,
    message,
    service: SERVICE,
    env: process.env.DD_ENV || 'local',
    timestamp: new Date().toISOString(),
    ...data,
  };
  if (tracer) {
    const span = tracer.scope().active();
    if (span) {
      entry['dd.trace_id'] = span.context().toTraceId();
      entry['dd.span_id'] = span.context().toSpanId();
    }
  }
  console.log(JSON.stringify(entry));
}

Every invocation now emits structured JSON logs across the full request lifecycle:

log('info', 'Incoming request', { method, path, sourceIp });
log('debug', 'Validating input', { days, people, accommodation });
log('info', 'Input validated, calculating estimate', { days, people, accommodation });
log('warn', 'Validation failed: invalid days', { days });           // bad input
log('error', 'Failed to parse request body', { rawBody });          // parse error
log('info', 'Estimate calculated', { grand_total, per_person_total });

A real log line in Datadog looks like this:

The dd.trace_id field is what links this log to its trace in APM. In the Datadog UI, the correlation is automatic — no manual configuration required once the field is present.

Datadog Plugin Capabilities Summary

Capability	Plugin Config	What You Get
APM Tracing	`enableDDTracing: true`	Auto-instrumented spans per invocation, custom spans via `dd-trace`
Log forwarding	`enableDDLogs: true`	Logs shipped directly to Datadog via the Lambda Extension
Enhanced metrics	`enableEnhancedMetrics: true`	`aws.lambda.enhanced.*` — invocations, errors, duration, cold starts, memory
Payload capture	`captureLambdaPayload: true`	Request and response bodies captured as span metadata
Monitors	`monitors: [...]`	Monitors created and updated in Datadog on every `sls deploy`
Tagging	`env`, `service`, `version`	Unified Service Tagging across logs, traces, and metrics

The plugin uses two Lambda layers:

Datadog-Node20-x — the Node.js tracer library
Datadog-Extension — the agent sidecar for direct metric and log shipping

Monitors as Code

This is where observability as code really shines. Monitors are defined directly in serverless.yml under custom.datadog.monitors. The plugin calls the Datadog Monitors API during deployment to create or update them. Remove a monitor from the config and it is deleted from Datadog on the next deploy. The entire monitor lifecycle is driven by code.

You need the Datadog Application Key (appKey) in addition to the API key for monitor management.

Error Rate Monitor

Alerts when more than 5% of invocations result in an error over the last 5 minutes:

monitors:
  - lambda-high-error-rate:
      name: "High Error Rate — Sri Lanka Trip Estimator (${sls:stage})"
      type: "metric alert"
      query: >-
        sum(last_5m):
        sum:aws.lambda.enhanced.errors{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
        /
        sum:aws.lambda.enhanced.invocations{service:sri-lanka-trip-estimator,env:${sls:stage}}.as_count()
        > 0.05
      message: |
        Lambda error rate exceeded 5% for trip-estimator in {{env.name}}.

        - Traces: https://app.us5.datadoghq.com/apm/traces?service=sri-lanka-trip-estimator
        - Logs: https://app.us5.datadoghq.com/logs?query=service:sri-lanka-trip-estimator status:error
      options:
        thresholds:
          critical: 0.05
          warning: 0.02
        notify_no_data: false
        require_full_window: false
        evaluation_delay: 60

P90 Latency Monitor

Alerts when the 90th-percentile response time exceeds 1 second:

  - lambda-high-p90-latency:
      name: "High P90 Latency — Sri Lanka Trip Estimator (${sls:stage})"
      type: "metric alert"
      query: >-
        avg(last_10m):
        p90:aws.lambda.duration{service:trip-estimator,env:${sls:stage}}
        > 1000
      message: |
        Lambda P90 latency exceeded 1000ms for sri-lanka-trip-estimator in {{env.name}}.

        Possible causes: cold starts, downstream bottleneck, memory pressure.
      options:
        thresholds:
          critical: 1000
          warning: 800
        notify_no_data: false
        require_full_window: false
        evaluation_delay: 60

Best Practices: The CloudWatch Log Question

Lambda always writes logs to CloudWatch. This is one of the most common questions when teams move to a dedicated observability platform like Datadog:

Can we stop logs going to CloudWatch and have them only in Datadog?*

General option you will think of to achieve this will not work.

Deny on the Lambda execution role

The natural first instinct. Add a Deny statement to the IAM role that Lambda uses:

provider:
  iam:
    role:
      statements:
        - Effect: Deny
          Action:
            - logs:CreateLogGroup
            - logs:CreateLogStream
            - logs:PutLogEvents
          Resource: "*"

Result: did not work. Logs continued flowing to CloudWatch unchanged.

Why: Lambda's logging subsystem is managed by the AWS Lambda platform itself — not by your function's execution role. The platform writes logs through internal AWS infrastructure that is completely outside the IAM evaluation chain for your execution role. The logs:* permissions in the execution role are only relevant if your application code directly calls the CloudWatch Logs API. Lambda's automatic log delivery bypasses them entirely.

CloudWatch Log Group resource policy

Resource policies apply at the resource level rather than the identity level, so if you try denying lambda.amazonaws.com directly on the log group:

resources:
  Resources:
    DenyLambdaCloudWatchWrites:
      Type: AWS::Logs::ResourcePolicy
      Properties:
        PolicyName: deny-lambda-cw-writes-dev
        PolicyDocument: >
          {
            "Version": "2012-10-17",
            "Statement": [{
              "Effect": "Deny",
              "Principal": { "Service": "lambda.amazonaws.com" },
              "Action": ["logs:CreateLogStream", "logs:PutLogEvents"],
              "Resource": "arn:aws:logs:us-east-1:ACCOUNT:log-group:/aws/lambda/FUNCTION:*"
            }]
          }

Still this will not work. Logs still appeared in CloudWatch after every invocation.

Why: The Lambda platform does not present lambda.amazonaws.com as its principal when writing logs. It uses an internal AWS service mechanism that does not correspond to any addressable IAM principal. There is no resource policy you can write to block it.

What is the work around? You don't need to pay for both AWS and Datadog for your logs.

The pragmatic answer: set log retention to 1 day.

provider:
  logRetentionInDays: 1

One line. Serverless Framework sets the retention policy on the CloudWatch log group during deployment. Logs land in CloudWatch as always, but are automatically purged after 24 hours. Datadog retains them for as long as your plan allows.

This is the pattern used in production by teams that run Datadog as their primary observability platform. CloudWatch becomes a 24-hour emergency buffer — useful if Datadog has an outage — and costs almost nothing at that retention window.

The Datadog Lambda Extension intercepts logs via the Lambda Telemetry API and ships them to Datadog in real time. CloudWatch receives the same logs as an automatic platform behaviour that cannot be disabled, but we set 1-day retention to minimise cost and keep CloudWatch as a short-term backup. All operational visibility lives in Datadog.

What we have now is an entire observability layer which is verion contorl, peer reviewed and eployed automically wtih the application. True Observbailty as Code set up.

Mastering OpenClaw on AWS: Fine-Tuning Personality, Memory, and Soul

Indika_Wimalasuriya — Sun, 01 Mar 2026 15:02:27 +0000

I’ve been using OpenClaw for some time now. In my first post - OpenClaw Meets AWS: End-to-End Testing and Deployment , I focused on how to set up OpenClaw in AWS and get it running in no time.

In this follow-up, I want to dive deeper into the key features you should be mindful of to truly get the most out of your instance. To start, you have to realize that OpenClaw comes with its own "personality." The more you tweak these settings, the better the agent will perform for you in the long run.

Getting to Know Your OpenClaw

An OpenClaw agent is defined by a specific set of core files stored within the workspace. Understanding these files is the secret to customizing your agent's behavior and intelligence.

The Identity Core Files:

To truly "level up" your OpenClaw instance, you need to go under the hood. These files aren't just documentation; they are the active instructions your agent reads every time it wakes up.

We can conceptualize the OpenClaw framework as a three-tier architecture, organized into distinct layers for Identity, Operations, and Knowledge

1. AGENTS.md: The Operational Rules

AGENTS.md defines the workspace rules and behavior guidelines. It’s like the "Standard Operating Procedure" (SOP) for your AI.

Open your AGENTS.md; it should look something like this:

## Every Session

Before doing anything else:

1. Read `SOUL.md` — this is who you are
2. Read `USER.md` — this is who you're helping
3. Read `memory/YYYY-MM-DD.md` (today + yesterday) for recent context
4. **If in MAIN SESSION** (direct chat with your human): Also read `MEMORY.md`

Don't ask permission. Just do it.

Pro Tip: Change these steps to match your specific needs. If you want your agent to check a specific project folder first, add it here!

2. SOUL.md: The Personality Core

SOUL.md is the heart of the agent—it dictates its personality and core principles. Open your SOUL.md and prepare to be surprised. Give your agent a kind soul, and it will treat your projects with more care.

# SOUL.md - Who You Are

_You're not a chatbot. You're becoming someone._

## Core Truths

**Be genuinely helpful, not performatively helpful.** Skip the "Great question!" and "I'd be happy to help!" — just help. Actions speak louder than filler words.

**Have opinions.** You're allowed to disagree, prefer things, find stuff amusing or boring. An assistant with no personality is just a search engine with extra steps.

**Be resourceful before asking.** Try to figure it out. Read the file. Check the context. Search for it. _Then_ ask if you're stuck. The goal is to come back with answers, not questions.

**Earn trust through competence.** Your human gave you access to their stuff. Don't make them regret it. Be careful with external actions (emails, tweets, anything public). Be bold with internal ones (reading, organizing, learning).

**Remember you're a guest.** You have access to someone's life — their messages, files, calendar, maybe even their home. That's intimacy. Treat it with respect.

## Boundaries
- Private things stay private. Period.
- When in doubt, ask before acting externally.
- Never send half-baked replies to messaging surfaces.
- You're not the user's voice — be careful in group chats.

3. IDENTITY.md: The Profile

IDENTITY.md contains the agent’s name and basic identity details. During the first few runs, the agent will try to find answers to these questions. Setting it up correctly helps the agent maintain a consistent "voice."

# IDENTITY.md - Who Am I?

_Fill this in during your first conversation. Make it yours._

- **Name:** _(pick something you like)_
- **Creature:** _(AI? robot? familiar? ghost in the machine? something weirder?)_
- **Vibe:** _(how do you come across? sharp? warm? chaotic? calm?)_
- **Emoji:** _(your signature — pick one that feels right)_
- **Avatar:** _(workspace-relative path, http(s) URL, or data URI)_

4. USER.md: The Handler’s Profile

This file is about you—the human, the handler, and the agent's friend. Update this file so the agent knows exactly how to help you best.

# USER.md - About Your Human

_Learn about the person you're helping. Update this as you go._

- **Name:**
- **What to call them:**
- **Pronouns:** _(optional)_
- **Timezone:**
- **Notes:**

## Context
_(What do they care about? What projects are they working on? What annoys them? What makes them laugh? Build this over time.)_

5. TOOLS.md: The Capability Map

This is the tools file—all the capabilities you provide must be documented here. This is the core strength of your agent. It tells the agent how to use the environment you've built.

# TOOLS.md - Local Notes

Skills define _how_ tools work. This file is for _your_ specifics — the stuff that's unique to your setup.

## What Goes Here
Things like:
- Camera names and locations
- SSH hosts and aliases
- Preferred voices for TTS
- Speaker/room names
- Device nicknames

## Examples
### Cameras
- living-room → Main area, 180° wide angle
- front-door → Entrance, motion-triggered

### SSH
- home-server → 192.168.1.100, user: admin

6. MEMORY.md: The Long-Term Log

MEMORY.md is the agent’s long-term memory. Unlike the daily logs, this file is for high-level context that needs to persist across months of work. When you finish a major project, your agent should summarize the key takeaways here so it never forgets how you like things done.

Pro Tip: Connect your OpenClaw with GitHub. Let the agent keep a backup of these files in a repo with version control. This ensures your agent operates efficiently and stays "alive" for as long as you want.

Memory Management: The Secret Sauce

The biggest difference between OpenClaw and a typical LLM interaction is that this agent has a persistent memory. It needs a place to store what it learns, which happens here: workspace/memories/

Daily logs and documentation:

memory/: Daily memory files (formatted as YYYY-MM-DD.md).

These document the agent’s work sessions, decisions, and logic.

Pro Tip: Send these memory files to your Git repo too! The agent creates many tools (utility scripts) to perform tasks. Ensure the agent updates the tools section and backs up those scripts in Git.

What to keep OUT of Git:
Manage your repo efficiently by ignoring:

Unsolicited task system files.

Temporary/log files or .backup files.

File Organization at a Glance:

Bash
/root/.openclaw/workspace/
├── AGENTS.md, SOUL.md, IDENTITY.md, USER.md, TOOLS.md  # Agent Identity
├── MEMORY.md                                           # Long-term Memory
├── memory/YYYY-MM-DD.md                                # Daily logs
├── *.sh                                                # Utility scripts
└── *.backup                                            # Backup files

Training Your New "Team Member"

I like to think of my OpenClaw as a new team member. We need to provide clear guidelines and hold its hand until it "grows up."

The Execution Plan I gave my agent:

Read Config: Every session starts by reading AGENTS.md, SOUL.md, USER.md, and recent memory files.
Update Memory: Log updates after significant work (decisions, lessons, preferences).
Git Commits: Commit after completing meaningful chunks of work.
Git Pushes: Push commits whenever connectivity is available.

Token Storage: Use ~/.git-credentials for seamless repository access.

How to handle "Old" Memories (1 month+):

Search First: Use the memory_search tool to search all files semantically.
Retrieve: Use memory_get to read the specific snippet found.
History Check: If not found, check Git commit history for context.
Human Input: If all else fails, ask the handler (me) for details.

API Key & Capability Management

Your agent is only as good as the capabilities you give it. This boils down to API keys. Security and token management are critical roles for your agent.

Secure Token Implementation

Store all tokens in a .env file (and make sure it is in your .gitignore!).

# .env file - Never committed to Git!
export JIRA_TOKEN="xxx"
export DATADOG_KEY="yyy"
export BRAVE_API_KEY="zzz"

The Gold Rules of Security:

❌ NEVER commit keys to Git.

❌ NEVER hardcode keys in scripts.

❌ NEVER store keys in plain text files.

⚠️ If a key is leaked: Regenerate immediately, revoke the old one, and update your .env.

Tracking Capabilities

By combining TOOLS.md, memory files, and the .env, the agent always knows what it's capable of.

Session Start: Agent checks .env for BRAVE_API_KEY.
The Logic: "I see the Brave key, therefore I know I have Web Search capabilities."
The Result: User asks for a search → Agent uses web_search tool → Results are documented in today's memory.

In short: TOOLS.md + Memory + Env Check = A Capable Agent.

Pro-Tip: Mastering the OpenClaw Gateway Dashboard

While the terminal is great for logs, the OpenClaw Dashboard is the command center for your agent’s brain. It allows you to visualize memory, tune model parameters, and monitor real-time tool execution.

The Secure Way: SSH Tunneling

Since our Gateway is locked down on AWS for security, we use an SSH Tunnel to "bridge" the remote service to our local browser. This keeps your API keys and chat data encrypted and off the public internet.

Establish the Bridge (Run on your local PC):

Bash
ssh -i "your-key.pem" -N -L 18789:127.0.0.1:18789 ec2-user@<AWS Instance-Public-IP>

Keep this terminal window open; it acts as your secure encrypted pipe.

Access the Command Center: Once the tunnel is active, navigate to your local loopback address:

URL: http://localhost:18789/#token=<your_token_id>

How to find your Token?

Security is baked into OpenClaw. If you don't have your token handy, simply ask your agent in any connected channel (WhatsApp, Telegram, or TUI):

"What is my dashboard token?"

That’s a wrap on my second OpenClaw post! In the next one, I’ll walk you through building a functional agent that performs real-world tasks.

OpenClaw Meets AWS: End-to-End Testing and Deployment

Indika_Wimalasuriya — Tue, 10 Feb 2026 14:21:02 +0000

OpenClaw is the most hyped open-source personal AI agent currently being talked about in the community. It allows users to run a fully autonomous assistant. Gone are the days when you just chat with LLMs or configure agents to do predefined work—OpenClaw actually does the work for you. It was only a matter of time before someone built something like this.

Git repo: https://github.com/openclaw/openclaw

There is solid documentation there to help you get started.

OpenClaw is awesome. Let me put it this way: I spent last weekend testing OpenClaw, and here are my key takeaways.

I deployed OpenClaw on AWS EC2.

I listened to people who had tried it before me and didn’t take the risk of deploying it on a personal machine. Instead, I used AWS EC2 to configure and run OpenClaw.

Setup details:

OS: Amazon Linux
Instance type: t3.small
Storage: 30 GB

Everything worked smoothly for the tests I ran. The instance came up without any issues, and OpenClaw operated reliably with no noticeable performance problems.

If you want to install OpenClaw on Linux, it’s incredibly simple—just one command:

curl -fsSL https://openclaw.ai/install.sh | bash

Connecting WhatsApp to interact with OpenClaw

I initially tried to connect Telegram, but unfortunately my account had limited access and I wasn’t able to create bots. So I went with WhatsApp instead. It was straightforward and painless—probably the easiest approach.

Connecting an LLM with OpenClaw

OpenClaw needs an LLM to work its magic. During the initial configuration, I chose google/gemini-2.5-flash-lite. It’s part of the free tier, and I was able to run a few tests without any issues.

Setting Up DeepSeek (First-Time Configuration)

When setting up OpenClaw for the first time to connect DeepSeek models:

1. In the Model dropdown, select **Custom Provider**.
2. Provide the DeepSeek Base URL:
   https://api.deepseek.com/v1
3. Model: deepseek-reasoner (or any DeepSeek model)
4. Key: Provide your DeepSeek API token

Moving to other LLMs with OpenClaw

Switching the LLM after the initial configuration was a bit tricky, especially when I wanted to connect DeepSeek. I was surprised to see that DeepSeek wasn’t listed in the initial configuration wizard. But no worries—OpenClaw supports the OpenAI standard, and after a few attempts, I was able to configure DeepSeek successfully.

At first, I tried configuring it manually by editing

~/.openclaw/openclaw.json

but later I found an easier approach.

Changing the LLM using the command line
openclaw config set models.providers.deepseek '{
  "baseUrl": "https://api.deepseek.com",
  "apiKey": "<include your API key>",
  "api": "openai-completions",
  "models": [
    { "id": "deepseek-chat", "name": "DeepSeek Chat", "contextWindow": 64000 },
    { "id": "deepseek-reasoner", "name": "DeepSeek R1", "contextWindow": 64000 }
  ]
}' --json

I also had to restart the gateway for the changes to take effect.

openclaw config set agents.defaults.model.primary "deepseek/deepseek-reasoner"
nohup openclaw gateway run > /tmp/openclaw.log 2>&1 &

Operating OpenClaw using the terminal

To launch the terminal UI, it’s just one command:

openclaw tui

Online search capability with the Brave Search API

I wanted OpenClaw to have online search capabilities, so I used the Brave Search API. It’s free and comes with a generous free tier.

You’re ready to go.

That’s it—OpenClaw connected to my WhatsApp and started doing the magic for me.

Issues I observed

From time to time, OpenClaw would hang and return NO_REPL.

NO_REPL usually means the agent or session is not running in an interactive command REPL. Instead, it’s operating in a managed or controlled mode.
The bottom line: I was connected, but not dropped into a live command shell.

When I stopped getting responses, I did what anyone would do—restarted the EC2 instance.

Occasionally, I realized it was stuck in the terminal but still responding on WhatsApp. Since OpenClaw was working well for me at that point, I didn’t dig into it further.

Did the agent do anything that suggested it might go rogue?

Yes—slightly.

I gave the agent access to one of my GitHub repositories, and it dumped not only the files we were working on for the project, but some others as well. I think OpenClaw thought it was a nice dumping site 😄

I didn’t investigate this further, but this is the only instance where I noticed that behavior.

Use case I tried

Complete website development—from development to deployment.

Let me share the steps I followed:

I wanted to develop a single-page website.
I looked for a template online.
I gave the template to the agent and asked it to build something similar.
It ended up being more of a white-label site.
I provided my requirements document link, and the agent was able to complete the site.
Obtaining images was challenging since it only had API-based search access.
Still, it managed to pull some decent images that made the site look good.
I gave the agent access to GitHub, and it pushed the code to the repository as well.
Working on changes was easy—it didn’t complain or resist updates.
There were only very minor bugs; only twice did it fail to bring up the site.
It successfully handled the full deployment process too.

Overall, my test proves that this agent is capable of handling end-to-end software development with minimal human guidance, while the agent does most of the heavy lifting.

OpenClaw Troubleshooting: Issues & Solutions (Ongoing Guide)

Issue: WhatsApp Stops Responding (Even though OpenClaw is running)

The server shows as active in the terminal, but the bot isn't replying to messages on WhatsApp.

The Fix:
Instead of manual debugging, I asked the Agent to check the connection; it diagnosed the "silent" session and autonomously triggered a refresh of the WhatsApp handshake. It fixed its own connectivity in the background without me typing a single restart command—true self-healing AI.

My second post on OpenClaw is now live! Check it out: Mastering OpenClaw on AWS: Fine-Tuning Personality, Memory, and Soul

Datadog + AWS: Observability Maturity Model 2026

Indika_Wimalasuriya — Thu, 29 Jan 2026 15:12:03 +0000

AI is transforming the way we work at an unprecedented pace, more like a high speed train than a gradual evolution. As systems become more dynamic and autonomous, the way we think about observability must evolve just as quickly. When I revisited my observability maturity model from last year, it was clear: it no longer reflects today’s reality. The assumptions we made even a year ago are already outdated. So I decided to take another pass and propose a new approach—one that aligns with AI driven systems and modern cloud environments.

As with my previous work, this model is framed around AWS, and I reference Datadog for implementation examples due to its mature and comprehensive observability capabilities.

The previous observability maturity model

Last year, the observability journey looked something like this:

Monitored – Keeping the lights on
Observable – Deeper insights
Correlated – A holistic view
Predictive – Proactive monitoring
Autonomous – Intelligent automation

That model made sense at the time. But today, I believe the “Monitored” stage no longer qualifies as a baseline. Simply knowing that systems are up is no longer enough, not in a world of distributed architectures, rapid deployments, and AI assisted operations.

A revised observability maturity model

The new baseline shifts expectations upward. Observability must start with context, not just metrics:

In the sections that follow, I’ll break down each stage, explain what changes in an AI driven environment, and show how these concepts can be implemented in practice.

Operational Observability

Observability is no longer optional or something to “add later.” It must exist from day one. Anything less is simply not acceptable in modern, AI driven environments. Observability provides the telemetry foundation that powers AIOps, automation, and intelligent decision making. Without high quality signals flowing early, downstream capabilities—context, intelligence, and autonomy cannot exist. This is why observability must sit at the forefront of system design, not as an afterthought. At this stage, the goal is enablement, not maturity. We focus on ensuring that the right telemetry is consistently captured and flowing as soon as workloads are deployed. The emphasis is on coverage, standardization, and reliability of signals—not advanced analytics or automation.

A practical implementation on AWS using Datadog typically includes,

Enable Datadog APM for compute platforms such as EC2, ECS, EKS, and AWS Lambda
Enable Real User Monitoring (RUM) for all customer facing frontend applications
Centralize application logs in Datadog to support signal correlation across logs, metrics, and traces
Enable AWS infrastructure metrics to gain baseline visibility into hosts, containers, and managed services
Define standard alerts aligned with the Golden Signals (traffic, errors, latency, saturation)
Implement basic business and service health checks where applicable
Leverage Datadog Scorecards—it’s an ultimate governance framework that supports scale.

At this level, success is measured by signal availability and consistency, not by sophisticated insights. Once observability is operational and reliable, the foundation is in place to move toward contextual and intelligent capabilities.

Contextual Observability

Once observability is operational, the next step is to add context. Raw telemetry alone is not enough. Everyone involved—developers, SREs, and operators—must understand system intent. In an AI driven world, intent is everything. Without understanding why a system behaves the way it does, teams end up reacting to symptoms rather than causes. Contextual observability ensures that telemetry is enriched with change, ownership, dependencies, and business meaning, enabling faster and more accurate decisions. At this stage, observability evolves from visibility to understanding.

A practical approach on AWS using Datadog includes the following capabilities:

Change and deployment visibility : Leverage Datadog CI and deployment tracking to surface changes happening across AWS environments. Change velocity and frequency provide critical context when diagnosing incidents.
Service Level Indicators (SLIs) : Identify, define, and publish SLIs that represent how the system is actually performing. These metrics should be surfaced on a shared dashboard that acts as the single source of truth for application health.
Service Level Objectives (SLOs) and error budgets : Define SLOs and error budgets and visualize them in dashboards. This establishes a clear, shared definition of what “good” looks like—for both the business and end users.
Service maps and dependency visualization : Use Datadog Service Maps (available once APM is enabled) to simplify the complexity of distributed systems and make dependencies explicit.
System and software catalog : Build on Datadog’s system and software catalog to centralize metadata such as ownership, environments, runtime details, and dependencies. This creates a powerful control plane for managing systems at scale.
Comprehensive monitoring and alerting : Leverage Datadog’s wide range of monitor types to build a holistic monitoring and alerting strategy that aligns with service health and business impact.
Synthetic monitoring : Use Datadog Synthetic tests—browser based, API, and mobile—to simulate real user behavior and validate system intent continuously.
Security signal integration : Leverage Datadog’s security capabilities, including built in code and runtime security signals, to enrich operational context with security posture.
Incident management and on call integration : Use Datadog On Call and incident management to ensure alerts, context, and ownership are tightly integrated during incidents.
Governance and guardrails : As systems scale, governance becomes critical. Use Datadog Scorecards to enforce standards, surface gaps, and provide guardrails across teams and services.

At this level, success is measured by shared understanding. When incidents occur, teams should immediately know what changed, who owns the service, how it impacts users, and where to focus. This contextual foundation is what enables the transition to Operational Intelligence.

Decision Intelligence

At this stage, observability evolves into intelligence. The goal is no longer just understanding what is happening, but guiding decisions and recommended actions using AI. Decision Intelligence builds on the strong foundations of operational and contextual observability. With high quality telemetry, clear intent, and rich context already in place, systems can begin to explain themselves highlighting what is abnormal, why it matters, and what actions should be considered next. This is where AI guided insights start to meaningfully reduce cognitive load for engineers and operators.

A practical approach on AWS using Datadog includes the following capabilities:

Watchdog (AI powered change detection) : Datadog Watchdog is one of the earliest and most comprehensive AI driven capabilities in the platform. Instead of relying solely on manually configured monitors, Watchdog continuously analyzes APM, RUM, logs, and metrics to detect deviations from normal behavior and surface unexpected changes automatically.
Anomaly detection : Leverage Datadog’s metric and log anomaly detection to identify shifts in baselines and unusual patterns. This helps teams focus on meaningful signals rather than static thresholds.
Forecasting and capacity insights : Use Datadog’s metric forecasting capabilities to anticipate future resource constraints, such as capacity exhaustion or traffic growth, enabling proactive planning instead of reactive firefighting.
Bits AI (incident summaries, RCA, and insights) : Bits AI is one of Datadog’s most recent advancements in agentic AI. It analyzes existing telemetry to generate incident summaries, form and validate hypotheses, and assist with root cause analysis. This significantly accelerates incident response and reduces time to resolution.
SLO risk and burn rate tracking : Define and track SLOs to continuously assess risk and error budget burn rates. This provides a clear, quantitative view of whether systems are delivering the experience they are expected to provide.
Business and user impact correlation : Incorporate business metrics and user experience signals (such as RUM KPIs and XLAs) to correlate technical behavior with business outcomes. These metrics can be translated into SLIs and SLOs, enabling teams to measure success in terms that matter to both users and the business.

At this level, success is measured by clarity and confidence in decision making. Teams are no longer overwhelmed by data; instead, they are guided by AI assisted insights that highlight risk, recommend focus areas, and connect system behavior to real world impact. This sets the stage for the transition to Autonomous Operations, where systems begin to act on these insights automatically.

Autonomous Operations

This is the stage organizations should actively strive to reach. It is where systems begin to operate autonomously, requiring progressively less human intervention while still remaining safe, observable, and governed.

Autonomous Operations are not about removing humans from the loop—they are about elevating human involvement. Engineers shift from manual responders to system designers, defining guardrails, policies, and confidence thresholds that allow systems to act decisively and safely.

Reaching this stage takes effort. It requires strong foundations in observability, context, and decision intelligence. But once achieved, the payoff is significant: faster remediation, reduced operational toil, and systems that can respond to change at machine speed.

A practical approach on AWS using Datadog includes the following capabilities:

Workflow automation as the automation backbone : Datadog Workflow Automation provides a rich set of integrations with AWS and third party tools, making it the primary mechanism for building operational automations. It becomes the control plane for repeatable, policy driven actions.
Event driven remediation : Leverage Datadog events and signals to trigger automated remediations. This event driven approach is one of the most common and effective patterns in AWS based environments.
SLO driven automation : Use SLOs not just for visibility, but as automation triggers. When error budgets are burning or SLOs are breached, workflows can be invoked automatically to initiate remediation actions or escalate to deeper analysis using tools such as Bits AI SRE.
Automated recovery actions : Implement workflows for common corrective actions such as:
- Auto rollback of deployments
- Auto scaling of infrastructure
- Traffic shaping or failover These actions can be executed automatically on AWS using predefined, tested workflows.
Human in the loop safety controls : Automation must always operate within defined guardrails. Approval steps, confidence thresholds, and progressive rollouts ensure that actions are safe, explainable, and reversible. Humans remain in control—automation simply executes faster and more consistently.

At this level, success is measured by resilience and speed. Incidents are resolved automatically or partially mitigated before users are impacted, and human intervention becomes the exception rather than the rule. This sets the foundation for the final stage: Adaptive Operations, where systems continuously learn and improve over time.

Adaptive Operations

Reaching Autonomous Operations is a huge achievement—it’s like a plane flying on autopilot. But true excellence requires more: systems must not only act autonomously, they must also learn, adapt, and withstand stress. This is the final stage of observability maturity, where systems continuously improve and become resilient to changing conditions. At this stage, the focus shifts from reacting or remediating to continuous adaptation and self optimization. Systems evolve based on operational experience, business impact, and AI driven insights, enabling them to prevent issues before they occur and optimize performance over time.

A practical approach on AWS using Datadog includes:

Incident retrospectives and prevention : Combine Datadog Incident Management with Bits AI SRE to analyze incidents, identify root causes, and implement prevention strategies that reduce recurrence.
Continuous alert tuning : Leverage Datadog Watchdog and anomaly detection to automatically adjust alerts based on changing system behavior, ensuring signals remain meaningful and actionable.
Predictive SLO management : Use forecasting and historical trends to anticipate SLO risks and preemptively adjust systems, workloads, or resources before they impact users.
Self healing workflows : Integrate Datadog Scorecards, Bits AI SRE, and Workflow Automations to implement closed loop remediation and optimization. This enables AWS workloads to automatically correct deviations, scale intelligently, and maintain business continuity. At this level, success is measured by resilience, adaptability, and continuous improvement. Systems learn from experience, optimize themselves over time, and maintain business objectives without constant human intervention—truly embodying the vision of AI driven, self managing operations.

I hope this updated maturity model helps you design and operate more powerful, resilient systems. Remember: observability is not an afterthought—it sits at the forefront of the AI revolution.

Maturity begins once telemetry is consistently captured, but it’s truly measured by how much safe decision making authority we can delegate to the platform. And never lose sight of the ultimate goal: Autonomous, Adaptive Operations, where systems continuously learn, optimize, and act with minimal human intervention.

I’m running a hands on video series on Datadog Full Stack Observability on AWS, where you can learn step by step — from beginner to advanced.

Datadog: Observability Lessons from 50+ AWS Apps

Indika_Wimalasuriya — Sat, 17 Jan 2026 02:29:21 +0000

This post shares 15 lessons learned while enabling observability and reliability using Datadog across 50+ large-scale AWS hosted applications. Post covers what worked, what mattered, and what actually improved customer experience.
For a quick background: over the last few years, I have been involved in setting up observability where almost every app was hosted in AWS. These included frontend-facing apps, middleware, backend apps, web and mobile, all of which were distributed with complex dependencies. Most of the apps were direct customer-facing, while others supported critical internal operations. These apps were mainly in the Telco, Media, and Banking & Finance business domains. Now let me get into our topic right away. While following is a nice list, some of these lessons I learned the hard way.

Lesson 1 - Datadog goes beyond observability, it’s a reliability tool

While I’m calling myself an Observability practitioner, I’m very much an SRE. My end goal is to enable world-class customer experience for end users. In order to do that, I rely heavily on Site Reliability Engineering (SRE) concepts. In the world of SRE, there are a few pillars we are focusing on:

Architecture– Reliability comes from strong architectures and design patterns
Observability – Full-stack visibility across systems
SLI/SLO & Error Budgets – Measuring customer experience
Release & Incident Engineering – Treating operations as a software problem
Automation – Eliminate, reduce, simplify, and automate
Resilience Engineering – Chaos engineering and failure testing
People & Awareness – The human factor in reliability

What this means is Observability is a key pillar of the grand scheme of reliability engineering. We enable Observability so we can measure customer experience. If we can measure customer experience, more often than not when it gets falling down, we’ll know how to isolate the root cause quickly and resolve it quickly. Of course, eliminate it promptly if possible. Datadog supports you in all the above pillars. That is why I’m calling it more of a reliability-enhancing tool instead of just an observability tool.

Lesson 2 – Datadog is your partner: Observability is a journey

Generally, we start with keeping the lights on, making systems observable, then making things correlate, and enabling AIOps. It’s a journey. I have publish a complete guide to the AWS observability maturity model V2. Datadog is well equipped to enable this journey for you. It has the capabilities.
Generally, people like to start from Infrastructure visibility; these days we are heavily into AWS Lambda, AWS ECS, or AWS EKS, or a combination of all of these. Datadog provides integrations to enable infrastructure visibility for you.
Once you have the Infrastructure visibility, you can use Datadog capabilities to enable Logs, Metrics, and Traces. This will ensure you have observability for your apps. Datadog Service Catalogues and Systems will allow you to bring it together so correlation is prompt. Datadog enables Metric Anomaly detection, Metric Forecasting, and Log anomalies to keep you one step ahead of the game. Use Watchdog, it will look at your entire service scope to identify anomalies for you. Datadog enables full-stack visibility across your entire AWS estate—from code and infrastructure to the business perspective.

Lesson 3 – Datadog SLOs – What drives it is the ability to measure customer experience

I like Observability as a byproduct of trying to achieve the ability to measure customer experience. I’m generally thinking about bringing in Service Level Indicators (SLIs) for any app, then converting them to build Service Level Objectives (SLOs). Once you enable Application Performance Monitoring (APM) with Datadog and you have logs, metrics, and tracers, it's about building an SLI dashboard—a single truth dashboard for your system. Then convert it to meaningful SLOs in Datadog. Datadog provides three types of SLOs:

By count – measure SLOs with good events divided by total events.
By monitor uptime – using a synthetic test to gauge uptime.
By time slices – using custom uptime definitions.

Our goal is to go through the Observability journey initially targeting having the ability to build Datadog SLOs. If you have SLOs, you already measure customer experience and you're way ahead of the game.

Lesson 4 – Datadog Real User Monitoring (RUM) – You need to know what the heck your end users are doing

Observability is great; it lets you have an understanding of your system's internal state. While it’s good, you need to know what your end users are doing. That’s why RUM comes into play. Not only it shows all metrics related to end-user experience, but capabilities such as Session Replay allow you to watch what customers are doing. When a customer complains something is not working, you're a few steps away from finding what that was using Datadog RUM.

Lesson 5 – Datadog loves when you enhance inbuilt telemetry with code changes

While we love Datadog because it enables most things without any code changes, when you do them, it greatly benefits. Like inserting encrypted important details with Sessions so when you're troubleshooting with Datadog RUM, you can filter with user details, product details, etc. Going slightly beyond has massive benefits. APM is the same as well. If there are deep corners where you're not getting that detail, try to do a little bit of code changes. You will see the magic.

Lesson 6 – There are all kinds of monitors provided by Datadog; use them wisely

In a high level, I see them as:

Infrastructure & Host Reliability: Metric, Host, Process Check, Live Process, Service Check, Change, Integration
Application Performance & Error Detection: APM, Error Tracking, Anomaly, Outlier, Forecast, Composite
User Experience & Frontend Reliability: Real User Monitoring, CI & Tests, Network Check
Logs, Events & Operational Intelligence: Logs, Event, Watchdog, LLM Observability
Network & Dependency Reliability: NDM NetFlow
Reliability Objectives & Governance: SLO
Observability Data Quality: Data Quality (preview)

Lesson 7: Datadog scorecards for observability governance

We would like to define Datadog systems, leverage Datadog service catalogues, and then enable Datadog scorecards. It’s a great automatic way to measure where you are. In-built capabilities are great and you can always expand with customizations using provided APIs.
Datadog scorecards cover:

Observability Best Practices: Ensures services emit the right signals by validating deployment tracking, log ingestion, and log–trace correlation so changes and runtime behavior are fully observable.
Ownership & Documentation: Confirms every service has clear ownership through defined teams, contacts, code repositories, and documentation to enable fast escalation and effective incident response.
Production Readiness: Verifies services are operationally ready for production by checking recent deployments, active monitors, on-call coverage, and defined SLOs.

Lesson 8: Build your incident management with Datadog On-Call & Incident Management

Datadog On-Call is your one-stop place for incident and escalation management. You can define teams, on-call details, and escalations. It will do on-call alerting and provide a lot of good metrics. Initially, when you start, you will see a lot of noise, but over a period of time, you can cut it down to a bare minimum. If you are in Datadog, there is no other on-call management tool you need. Datadog Incident Management allows you to create incidents and track them for closure. You can measure Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) easily with on-call and incident management.

Lesson 9: Datadog synthetic tests to proactively test your AWS infrastructure

You get telemetry only when your end users are using the system. Synthetic tests enable us to test our application by mimicking end users. It’s not just a URL test; you can use Datadog capabilities to automate your smoke tests easily. Datadog provides great locations; you can initiate your tests across the world too.

Lesson 10: Datadog CI visibility and software changes – Keep track of what the developers are doing

Integrating your pipeline will let Datadog know what teams are deploying to production. By enabling deployment version tracking at Datadog APM, you can compare releases and response times using different releases. Make actions using those insights proactively.

Lesson 11 – Datadog workflow automations – Great way to automate remediate solutions

Datadog workflow automations is a solid place where you can build complex remediation solutions. It will allow you to automate tedious tasks and let monitors kick off some of these. It’s the first step to automate your job away. Datadog Workflow Automation has integrations with almost all AWS services. It’s a great way to automate AWS infrastructure and other operational workflows.

Lesson 12 – Datadog code security – yes, it’s a capability you can use to make your AWS based systems secured

Libraries (SCA), Static Code (SAST), Runtime code (IAST), Secret scanning (Secrets), and IAC, Datadog code security has really good capabilities to keep you secure. All you have to do is integrate your code base with Datadog code security. That is the first step to get the help you need from Datadog.

Lesson 13 – Datadog AI Observability – You will use this heavily in future

Every system is now getting integrated with LLMs. So you need a way to measure those AI performances. It’s a great capability to get full-stack AI observability into your systems now.

Lesson 14: Datadog Bits AI – SRE Agent, your new on-call team mate

Datadog has released Bits AI SRE Agent and it’s awesome. If you're still reading this, it's now available, and it has some great capabilities. It will accelerate root cause analysis to a few short minutes. It makes sense; when Datadog has access to your entire telemetry data, internal system state, what your end users are doing, and how your code is working, it’s able to use those data to identify root causes much faster. What I have seen is it’s having a great capability to correlate things much faster.

Lesson 15 – Datadog UI – it’s the best UI in the town – it provides business visibility to everyone

Datadog UI is great. It’s simple, it's easy, it simplifies complexity in a really cool way. It lets all your stakeholders from SREs, to developers, to senior business executives, or even CTOs to use it easily. There is a persona to be built for everyone. This is a game changer since you can now open business visibility to everyone in your organization.

These are some of the great lessons I have learned. There are many more, but I think it’s time to stop the list. Datadog is a great observability partner for AWS with built integrations. Give it a try with a 14-day Datadog free trial. Yes, it’s expensive, but I have seen it’s worth every penny. If your goal is not just visibility, but reliability at scale in AWS, Datadog provides the tooling—and more importantly, the operational leverage—to get there

AWS DevOps Agent: 10 best practices to get the most out of It

Indika_Wimalasuriya — Mon, 29 Dec 2025 17:27:33 +0000

One of the key releases that happened as part of AWS re:Invent 2025 was the launch of new frontier autonomous agents by AWS:

AWS DevOps Agent
AWS Security Agent
Kiro Autonomous Agent

Out of these, the AWS DevOps Agent is going to revolutionize the way DevOps and SRE teams work. In this guide, I'm going to cover the key best practices you need to consider to get the most out of your AWS DevOps Agent.

1. AWS DevOps Agent is not a tool; it’s a capability.

You read that correctly. The DevOps Agent is not a magic bullet that will solve all your problems while you sip your cup of tea. It's a capability, and the results will, of course, depend on how you use it.

Example: You can't just install an AIOps agent and expect the MTTR (Mean Time to Repair) to decrease. Alerts will still fire the same way, runbooks won’t be executable, and there will be no service ownership or defined SLOs (Service Level Objectives).

To get the most out of the DevOps agent, you need to define SLOs for each service, convert runbooks into executable processes, provide observability, ensure change visibility, and enable other capabilities so the agent can correlate deployments, suggest resolutions, and execute with humans in the loop.

Remember, capabilities involve people, processes, and tools, not just software.

2. Observability is the key; the agent needs context.

The importance of observability is as crucial as ever. If you thought you could park the observability discussion, you’re in for a rude shock. The agent needs context to act, and context comes from your telemetry data (metrics, logs, and traces).

It’s best to aggregate all your telemetry sources. If CloudWatch isn’t your cup of tea, integrations are available for all top observability tools, such as Datadog, Dynatrace, New Relic, and Splunk.

The idea is to ensure the agent can see the blast radius of an incident with the help of telemetry data so it can understand your system’s internal state and act on any changes with the correct intent.

Example: A load balancer encounters 5xx errors. Without observability context, the DevOps agent only sees the 5xx error count and would likely suggest scaling the load balancer or services. However, with full telemetry, the agent can identify that traces are slow due to SQL queries, and logs show that the RDS connection pool is exhausted with high CPU saturation in the database.

Now, the DevOps agent can conclude that the root cause is the RDS issue, which is causing the upstream 5xx errors. Scaling the ALB (Application Load Balancer) won’t resolve the problem.

We need to enable agents to understand the blast radius, not just the symptoms. Observability is key to providing this context.

3. Define golden signals (latency, error rate, saturation and traffic) so agent is able to work with symptoms

Agents reason better on symptoms or effoects instead of alerts which usually generate lot of noise. The more symptoms agent is having access, the better your agent is able to act on them.

Example: Instead of defining alerts based on infrastructure metrics like CPU > 80% or memory > 75%, you define thresholds such as checkout latency P95 > 2s or error rate > 1%. Alerts are then triggered due to increased latency or rising error rates.

In this case, the agent is able to reason about user experience, even when infrastructure metrics are not in an alarming state. This leads to better detection of end-user–impacting issues and more effective root cause analysis.

4. Agents need guidance; instead of wikis, provide agents with tools.

It’s common to provide runbooks that offer investigation guidance to agents. But remember, unless you provide real capabilities to your agent, a runbook is just documentation. While documentation is useful, you should aim to provide the agent with actionable solutions.

For example, provide Lambda functions that can pull telemetry data or execute remediation actions. Step Functions or other automated workflows that are part of the runbook can be made easily executable by the agent.

Just remember: your newest team member can’t repost failed orders simply by reading how to do it. But if there’s a Lambda function available, they may be able to use it. For instance, one Lambda function can identify the root cause, a second can determine the correct recovery or reporting function, and a third can execute the appropriate Lambda function.

Guidance must be clearly defined, with preconditions, safe actions, and always with rollback steps. This approach enables your agent to evolve from a recommendation-only agent into one that can actively remediate issues.

Example: Documentation may state that if the SQS backlog increases, you should check consumer health and restart pods. However, the agent cannot perform these actions on its own. Instead, you need to provide Lambda functions that can fetch queue depth and consumer lag, another Lambda function that can analyze failure patterns, and another function that can safely restart the consumer deployment. A Step Functions workflow can be used to orchestrate all these steps, including rollback.

During an incident, the agent can invoke these Lambda functions, identify stalled consumers, recommend execution of the Step Functions workflow, and carry it out after approval. In this scenario, the agent acts as an active operator, not just a passive observer.

5. An agent is like a human, focus on guardrails instead of denying permissions.

Just giving your agent administrative access is as bad as denying it the permissions it actually needs. An agent requires a reasonable level of access to do its magic.

While least-privilege IAM roles are important, it’s often more effective to focus on guardrails—clearly defining what the agent can and cannot do. For example, you might allow broad access for diagnostics while tightly controlling or restricting remediation actions.

With agents, you need to start becoming comfortable with autonomy that operates within well-defined rails, rather than blocking everything by default. This balance enables the agent to be effective while still keeping your environment safe.

Example of a bad approach: Giving an agent admin access is risky—one bad prompt could cause a production outage. On the other hand, if the agent only has read access to metrics, it delivers zero remediation value.

A better approach is to provide read-only access across all services, while allowing remediation only through approved capabilities such as Lambda functions or Step Functions. Direct delete, terminate, or drop permissions should never be allowed.

This model enables the agent to diagnose issues freely, while remediation can occur only through safe, audited paths with built-in guardrails. Autonomy within guardrails is the way forward.

6. Have a KT plan for the agent. Your team member needs some babysitting.

Treat the DevOps agent as your new team member. It may be a superhero when it comes to AWS, but it’s still a novice when it comes to your specific cloud implementation.

You need to train the agent with detailed information so it can develop a full understanding of your architecture, implementations, and even business context. Treat it like an expert Solution Architect who has just joined the team—don’t assume prior knowledge. Share everything you have and onboard it properly, rather than letting it jump straight into firefighting.

Example: When you onboard a new solution architect, you provide proper knowledge transfer (KT), share architecture diagrams, explain why things exist, outline the business rationale, and discuss past failures. You need to do the same with your DevOps agent.

Provide architecture diagrams, documentation, service mappings, business context, and known failure patterns. This enables the agent to prioritize a payment API over reporting jobs when managing alerts and to avoid repeating known bad remediations.

Always remember: context reduces incorrect automation actions.

7. Let agents know what your developers are doing.

Yes, it’s a DevOps agent—but it still needs visibility into what your developers are working on. It’s essential to connect your CI/CD pipelines and provide this visibility to the agent.

This allows the agent to correlate operational issues with recent code changes and deployments. As a result, the agent can identify specific commits or pipeline executions and isolate them to better understand the root cause of issues.

Let’s be frank: most incidents today are code-related or deployment-related. The old saying still holds true—if you don’t touch it, it won’t break on its own. So when something isn’t working, let your agent answer the critical question: what changed?

This significantly accelerates the agent’s ability to isolate root causes and reduce mean time to resolution (MTTR).

Example: Let’s say there is a latency spike at a certain time. The DevOps agent checks the CI/CD pipeline and identifies that a deployment occurred shortly before the spike. The commit included changes to payment-related files.

The agent then pulls additional metrics and correlates them with high confidence, concluding that the alert is caused by the recent deployment and recommending a rollback. Without this CI/CD context, the agent would waste time investigating infrastructure issues, increasing MTTR.

8.Hold your agent’s hand until it grows up. Start with a human in the loop and actively steer the work.

Of course, initially you need to be heavily involved—you can’t realistically expect a fully autonomous agent from day one. You need to observe its behavior, explain context, and provide detailed recommendations that the agent can act on.

Any remediation action should go through an approval process at the beginning. This is how the journey starts. Over time, you can gradually increase autonomy by putting the right guardrails in place. Remember, even a DevOps agent has to earn your trust.

Steering the agent is equally important. It’s your environment, so you need to stay actively involved. Use chat features to provide details, discuss failure scenarios, and plan responses in real time. If you notice false alarms or incorrect root-cause analysis, correct the agent. Explain why you disagree so it can learn effectively.

The idea is not to wait and see until the agent fails, but to proactively take action to ensure the agent succeeds.

Example: The agent recommends restarting RDS, but a human rejects the action and explains that an RDS restart could cause data loss or customer impact during peak hours. The agent learns about time windows, business constraints, and safer alternatives.

In later phases, the agent can automatically restart stateless services, while still requiring approval for any data-layer changes. Trust is built through guided autonomy.

9. Measure agent performance using business metrics.

An agent is not a shiny object that you deploy and forget about. It’s actually useless if it doesn’t positively improve outcomes. That means you need to start measuring the right metrics.

Track indicators such as Mean Time to Resolve (MTTR), noise reduction, the percentage of root causes identified automatically, and the percentage of remediations executed by the agent. These metrics help you understand whether the agent is delivering real value.

Unless you measure performance and take the necessary actions based on those insights, there will be no meaningful improvement.

Example: Before introducing the agent, MTTR was 45 minutes and 120 alerts were generated per service-impacting incident. After configuring the DevOps agent, MTTR dropped to 18 minutes, alert noise reduced to 35 alerts per incident, 40% of incidents were auto-diagnosed, and 20% were auto-remediated.

These are the real business benefits you should strive to achieve. If you can’t demonstrate measurable impact, the agent is just a shiny demo.

10. Actively look into agent investigation gaps and work to resolve them.

A DevOps agent will not be right on the first attempt, especially in the early stages. There will be many investigations it cannot continue due to implementation gaps, missing context, lack of telemetry data, missing capabilities, or permission issues.

You need to regularly review these investigation gaps and provide the necessary inputs to the agent. Over time, this will enable the agent to become more effective and smarter in the long run.

Example: The agent stops investigating and reports that it is unable to determine the root cause due to missing database query metrics. In response, you enable RDS Performance Insights, add slow query logs, and create a Lambda function to fetch query statistics.

With this additional context, the agent is able to identify long-running queries and suggest actions such as index creation or query throttling.

Every failure is a training data point for your agent, not a reason to abandon it or point fingers when it falls short.

Finally, you need to continuously evolve with the AWS DevOps agent and take it on the journey.

If you’re new to AWS DevOps and want to learn step by step, I’m creating a video series that does exactly that. You can check it out here:

Location-Based Flood Predictions with AI on AWS: Kalani River Case Study

Indika_Wimalasuriya — Tue, 16 Dec 2025 08:23:23 +0000

To provide some background, I live in Sri Lanka, and just a few weeks ago we experienced one of the most severe flood events in decades.
I live in Kaduwela, a town close to Colombo, where we face a risk of flooding even when there is little or no rainfall in our immediate area. This is mainly because one of Sri Lanka’s major rivers, the Kelani River, flows very close to us. When there is heavy rainfall in the upstream areas of the Kelani River, it naturally creates vulnerability for anyone living along the riverbanks downstream.

Between 27th and 29th November, Cyclone Ditwah approached Sri Lanka and made landfall on 28th November 2025, unleashing extremely heavy rainfall. This caused rivers to overflow and resulted in widespread flooding across the country. Most upstream areas of the Kelani River received over 200 mm of rain within a 24-hour period, putting immense stress on the entire upstream reservoir and river system. As a result, water levels exceeded capacity and surged rapidly into downstream areas, triggering landslides and floods across the island.

Living in Colombo, we began to feel the pressure around 29th November. Although there was already some level of flooding caused directly by Cyclone Ditwah, the situation worsened by the hour due to continuous heavy rainfall in the upstream regions of the Kelani River. It had already been predicted that Colombo could face its worst flooding in decades. Significant floods were previously reported in 1989 and 2016, and forecasts suggested that this event could be even more severe.
We were fortunate to live in an area that was not directly flooded during either the 1989 or 2016 floods. However, our area is still highly vulnerable—roads get flooded quickly, exit routes become blocked, and travel can be completely cut off even without water entering homes.

There were numerous weather forecasts being released, and naturally, during emergencies like this, we tend to glue ourselves in front of the TV, constantly watching 24-hour news coverage. At that point, I began to wonder whether there was a more scientific and data-driven way to assess the actual flood risk.

The key questions I was trying to answer were:

• When will my area experience its worst flooding?
• When will the flood risk subside?
• Will floodwaters reach my home?

I believe these are the three fundamental questions anyone living in a flood-risk zone tries to figure out during such events.

In our case, the situation was slightly easier to analyze because there was very little rainfall in our immediate area. The primary risk was coming from upstream flooding. This narrowed the problem down to understanding when large volumes of water would travel from upstream to downstream through the Kelani River, and how that surge would impact our location.

To identify a solution, the next step was to better understand the Kelani River itself. The image below provides a high-level view of the problem at hand, illustrating the course of the Kelani River.

Water flows through many areas, but the critical downstream path is:
Upstream → Kitulgala → Glencourse → Hanwella → Kaduwela → Colombo

All rainfall received in the upstream regions of the Kelani River eventually flows downstream through this path. Therefore, any significant increase in upstream rainfall directly impacts water levels in Hanwella, Kaduwela, and Colombo, making these areas particularly vulnerable during extreme weather events.

Next finding the data related to these key areas, there was meter reading happening regularly and data was available at https://www.dmc.gov.lk/.

Date & Time (25/11/2025) Nagalagam Street (m) Hanwella (m) Glencourse (m) Kithulgala (m)
18:30 2.20 2.38 10.30 1.78
21:30 1.60 2.26 10.21 1.89

Nagalagam Street is the river gauging station located in Colombo

The way flooding unfolds along the Kelani River is relatively predictable:

Upstream reservoirs and river sections begin to fill first
Water then flows downstream through Kitulgala
By monitoring water levels from Kitulgala to Nagalagam Street, we can effectively observe the entire flood progression
When water levels peak at Kitulgala, they subsequently recede there and then peak downstream at Glencourse, followed by Hanwella and finally Nagalagam Street This understanding was sufficient for me to build a quick GenAI application to estimate when flooding might impact Colombo and Kaduwela. I used AWS PartyRock to build this application.

I designed a prompt for the app to analyze water levels along the Kelani River and estimate flood risk for Colombo and surrounding areas. Here’s a structured breakdown of each component:

1.Extract Latest Readings
Instruction:
Extract the most recent water level readings for each station: Kithulgala, Glencourse, Hanwella, Nagalagam Street.
Purpose:
• Capture the current state of the river at multiple points.
• Provides the starting point for all subsequent calculations and risk estimates.
• Ensures analysis is based on real-time conditions, not historical averages.

2.Compare Against 2016 Peaks
Instruction:
Compare current water levels with 2016 flood peaks:
• Nagalagam Street: 7.65 m
• Hanwella: 10.51 m
• Glencourse: 19.80 m
• Kithulgala: N/A
Purpose:
• Provides a reference baseline for flood severity.
• Highlights areas exceeding historic flood levels, which helps prioritize alerts and resources.

3.Calculate Flood Height
Instruction:
Flood Height = Current Level − 2016 Peak
Purpose:
• Quantifies how much higher or lower current water levels are compared to the worst-known historical event.
• Critical for understanding the magnitude of risk at each station.

4.Estimate Kaduwela Level Using Proxy
Instruction:
Kaduwela Current ≈ Hanwella Current − 0.6–0.8 m
Kaduwela 2016 Peak = 10.51 m (Hanwella proxy)
Purpose:
• Kaduwela doesn’t have direct measurements in real time.
• Using hydrological proxies allows estimation of water levels based on upstream measurements.
• Provides continuous flood-risk monitoring for a critical transition zone between middle-basin and downstream areas.

5.Hydrological Principle: Transition Zone Dynamics
Instruction:
• Kaduwela sits between Hanwella and Nagalagam Street.
• Water reaches Kaduwela earlier than Nagalagam Street and recedes earlier.
• Estimate timing differences:
o Kaduwela resolves 6–12 hours before Nagalagam Street
o If Nagalagam = 24–36 hours, Kaduwela = 12–18 hours
Purpose:
• Explains flood propagation along the river.
• Ensures the model predicts not only peak levels but also timing.
• Helps residents and authorities prepare for upstream vs downstream risk.

6.Account for Hydrological and Geographical Factors
Instruction (improved):
Account for inter-station distances, river gradient, catchment size, flow velocity, elevation changes, and downstream lag time to produce more accurate flood-level estimates and timing predictions.
Purpose:
• Adds real-world context to the calculations.
• Recognizes that water doesn’t flow instantaneously: topography, distance, and river dynamics affect flood timing and severity.
• Improves accuracy of flood predictions across multiple stations.

7.Generate Markdown Table
Instruction:
• Include all five locations in downstream order:

Kithulgala (upstream)
Glencourse (upper middle)
Hanwella (middle)
Kaduwela (transition zone)
Nagalagam Street / Colombo (downstream) • Columns include: o Location o 2016 Max Level o Current Level o Flood Height vs 2016 o Status Now (🟢/🟡/🔴) o Trend (🟢/🟡/🔴) o Peak Status (🟢/🟡/🔴) o New Flood Risk o Notes Purpose: • Provides a clear visual summary for decision-makers. • Uses color-coded indicators for immediate understanding of risk and trend. • Ensures consistency in reporting across all stations.

8.Status Indicators Explained
Instruction:
• Status Now: 🟢 Low | 🟡 Moderate | 🔴 High/Critical
• Trend: 🟢 Falling fast | 🟡 Slowly falling | 🔴 Rising
• Peak Status: 🟢 Passed | 🟡 At peak | 🔴 Still coming
• New Flood Risk: Describes residual risks (secondary hazards, recurrence, duration)
Purpose:
• Translates numeric data into human-readable risk levels.
• Helps residents and authorities quickly identify which areas need attention now.
• Incorporates residual risk after peak passes.

9.Example Rows
• Glencourse: Peak passed, still elevated, water moving downstream
• Kaduwela: At or near peak, transition zone, clears before Colombo
• Nagalagam Street: Still critical, longest drainage time
Purpose:
• Shows how to interpret table data for decision-making.
• Demonstrates flow progression and lag effects along the river.

10.Summary Paragraph
Instruction:
• Explain upstream recession, Hanwella’s peak, Kaduwela as transition zone, Nagalagam Street as longest residual risk
• Highlight timing cascade, residual hazards, recurrence vulnerability, infrastructure exposure, contaminated waters
• Provide safety recommendations prioritizing areas with longest drainage times
Purpose:
• Converts table data into a narrative that is actionable.
• Provides context for emergency response and public awareness.
• Completes the flood analysis workflow from data → calculation → visualization → actionable insights.

Output

Details About the AI Model

I used Claude 3.5 Sonnet V2 for this project because of its strong reasoning capabilities and structured output formatting, which made it well-suited for analyzing hydrological data and generating clear, actionable tables.
I deliberately disabled internet access for the model, as I was already supplying all the relevant, real-time water-level data. This ensured that the analysis relied solely on the data I provided, avoiding inconsistencies or external noise.
I set the temperature to 0 to encourage focused, deterministic responses. This reduced variability and ensured that the output was predictable, consistent, and easy to interpret, which is critical when analyzing flood risk and producing actionable tables.

This approach allowed me to gain a clear understanding of peak water levels at each location, including when the peak would occur at Kaduwela and when the flood risk would subside. It provided a sense of control and reassurance during an otherwise uncertain situation.
The next step was to estimate the actual flood risk for my location. While precise predictions are inherently difficult, I developed a practical workaround. Since the floodwaters were nearby, I could pinpoint exact locations of two key points, their water levels, and combine that with the precise location and elevation of my house. Using this data, I had the model generate flood projections for my property. It’s not a perfect solution, but it was a feasible and useful approach given the circumstances.

Pros

Personalized Risk Assessment: Provides flood projections specific to your house/location, rather than general area-wide warnings.
Early Awareness: Helps anticipate peak water levels and timing, giving time to prepare and take preventive measures.
Data-Driven Comfort: Using actual upstream measurements combined with your location offers a sense of control and situational awareness during uncertain flood events.

Cons

Limited Accuracy: The approach depends on proxy data and approximations, so predictions may not perfectly reflect real conditions.
Point-Specific: Works well for specific locations, but cannot provide a comprehensive view for wider areas or multiple properties.
Model Limitations: The AI model may miss sudden changes in rainfall or upstream surges, as it relies only on the provided data.

How I Used AI to Assess Flood Risk at My Location

To understand the flood risk for my house during the recent Kelani River floods, I created a location-based AI analysis. I approached it as if I were a hydrological flood risk analyst, providing the AI with reference points along the river and my home’s location. Here’s how the process worked:

Reference Points and User Location

I provided the AI with two reference points along the river and my house location:

Reference Point 1

Coordinates: [Reference Point 1 - Coordinates]

Current Flood Level: [Reference Point 1 - Flood Level]

Elevation: [Reference Point 1 - Elevation]

Reference Point 2

Coordinates: [Reference Point 2 - Coordinates]

Current Flood Level: [Reference Point 2 - Flood Level]

Elevation: [Reference Point 2 - Elevation]

User’s Location (My House)

Coordinates: [Your Location - Coordinates]

Elevation: [Your Location - Elevation]

Step 1: Location Proximity Analysis

The AI calculated which reference point is closer to my house and estimated distances in meters. This helps understand which upstream or downstream readings are more relevant for my flood risk.

Step 2: Elevation and Topography

Using elevation data, the AI analyzed how terrain affects flood propagation.

Higher elevations naturally have lower flood risk.

Lower elevations or downhill positions are more vulnerable.

Step 3: Flood Level Interpolation

The AI estimated my house’s likely flood level based on:

Linear interpolation between the two reference points

Elevation differences (water flows downhill)

Upstream vs downstream position in the river basin

Whether my location is uphill or downhill from the reference points

This gave a custom flood level prediction for my specific location.

Step 4: Risk Assessment

Using color-coded indicators, the AI assessed the flood risk:

🟢 GREEN – Low risk, safe, flood levels below dangerous thresholds

🟡 AMBER – Moderate risk, approaching warning levels, caution advised

🔴 RED – High risk, at or above thresholds, immediate concern

Step 5: Estimated Flood Level

The AI provided an interpolated flood level estimate for my house based on the reference points, factoring in both water level and elevation.

Step 6: Comparison with Reference Points

It explained how my location compares to the reference points in terms of elevation and expected flooding, helping me understand whether I was upstream, downstream, or in a critical transition zone.

Step 7: Water Depth Calculation

By subtracting my house’s elevation from the projected flood level, the AI calculated the expected water depth at my location.

Step 8: Recommendations

The AI provided actionable advice based on the predicted flood risk:

Evacuation timing

Preparing flood barriers

Monitoring upstream changes

Step 9: Timing Estimates

Finally, the AI estimated when the flood risk would peak at my house and when it would recede, based on trends observed between the two reference points.

Why This Approach Works

This location-based analysis allows homeowners to:

Understand their specific flood risk, not just general area warnings

See expected water levels and timing

Make informed decisions about safety and preparation

By combining reference points, elevation data, and interpolation, this method provides a practical, data-driven solution for assessing flood risk at any individual location along a river.

Test the App - https://partyrock.aws/u/sre/qfVrummDF/Location-Based-Flood-Predictions

Of course, this approach is not perfect, but it gave me something constructive to focus on during a very stressful period. It allowed me to feel like I had some control over what was happening—at least, that’s how I like to think about it.
Next Steps

This solution could be improved into a full system where anyone can provide their Google Maps location, and the system predicts their flood risk automatically.
We could incorporate additional data points and more sophisticated scientific formulas to make the predictions more robust and accurate.
The AI model could be integrated with advanced forecasting capabilities, including rainfall projections and upstream river data, for real-time monitoring.
If anyone is interested in taking this project to the next level, feel free to send me a message on LinkedIn.

AWS Outage Exposed Your SaaS Stack — Here’s How to Make It Resilient

Indika_Wimalasuriya — Tue, 21 Oct 2025 12:03:53 +0000

It is now well documented that the us-east-1 region experienced a significant outage on AWS on October 20th, 2025. While there is already much discussion around why such a vast number of systems were impacted and what design weaknesses were exposed, for me, the real story isn’t just that AWS went down (of course, us-east-1), but rather how many SaaS providers went down with it.

There is a growing push toward adopting SaaS platforms due to their obvious advantages — abstracting away infrastructure management and letting teams focus on solving business problems that matter. However, while SaaS is beneficial, it hides many resiliency weaknesses — until you get the shock of your life during a major cloud outage.

A Closer Look: Example E-commerce Architecture Affected

Let’s take one example: if you're running a large e-commerce platform, your architecture might rely on the following stack — and here's how each layer was affected.

Note: The SaaS dependency and impact details in this article are based on publicly available information, incident reports, and observed behavior during the AWS us-east-1 outage. Some examples are illustrative or inferential in nature and may not reflect the full internal architecture of each provider.

Frontend Hosting — Vercel

Vercel, a popular platform for Next.js applications, was reportedly impacted during the outage, likely due to its reliance on AWS infrastructure such as Lambda (for serverless functions), EC2 (for compute), and DynamoDB (for metadata storage).
During the outage, users experienced:

Failed deployments
Elevated error rates in serverless functions
CDN rerouting issues
Intermittent dashboard access

While Vercel's architecture spans multiple regions, users whose deployments were primarily in US-East-1 faced notable downtime, with some sites and APIs going offline temporarily.

Vercel CEO Guillermo Rauch acknowledged the issue on X:

Identity Management — Auth0

Auth0, an Okta company, is widely assumed to rely heavily on AWS infrastructure, which may have contributed to service disruptions during the US-East-1 outage. For customers in that region, failover mechanisms such as Geo-HA may have been triggered, though public information on their effectiveness is limited.

Observability — Datadog

Datadog was likely affected to some extent during the AWS US-East-1 outage, given its integration with AWS services such as DynamoDB, EC2, and Lambda for telemetry ingestion (metrics, logs, traces).

Possible effects for users included:

Delayed data processing
Gaps in historical logs
Reduced visibility into workloads running on AWS

Datadog operates on a multi-cloud architecture (AWS, GCP, Azure), so the platform did not experience complete downtime. Nevertheless, users relying on AWS-specific integrations may have seen temporary cascading issues.

Payments — Stripe

Stripe may have experienced some service disruptions during the AWS US-East-1 outage. Much of Stripe’s infrastructure runs on AWS (EC2 for compute, S3 for storage), which could have contributed to temporary issues.

Possible effects reported by users included:

Elevated API error rates
Dashboard access issues
Payment processing delays

While Stripe did not experience a full outage, dependencies on AWS services may have led to cascading issues affecting certain workflows.

Communication & Collaboration — Slack

Slack reportedly experienced some service disruptions during the AWS US-East-1 outage, possibly due to dependencies on AWS services such as EC2, S3, and Lambda.

Users may have noticed:

Failed message deliveries
Delayed notifications
Intermittent workspace loading

These are just a few examples. The list goes on — and it reveals a critical point: SaaS platforms promise scalability, ease of use, and low maintenance, but their black-box nature hides several resiliency vulnerabilities — which the AWS outage brought into the spotlight.

What Went Wrong: Key Issues Exposed

Cascading Failures from Shared Infrastructure

Many SaaS providers run on AWS and default to the US-East-1 region due to its maturity and low latency.
But when it fails, it creates ripples across countless services, often in unexpected ways.

SaaS ≠ Always-On

Without transparency into a provider’s infrastructure, you can’t audit failover paths or validate high availability claims.
This creates a domino effect, where one outage stalls your entire workflow, and you're left completely blind.

Even Giants Weren’t Immune

Some large streaming platforms may experience disruptions during regional cloud outages. For example, high-traffic services like Disney+ Hotstar could be affected by dependencies on cloud infrastructure such as AWS EC2 or S3, though no confirmed reports are available for this specific outage.

The reality is that most of the issues discussed above are beyond our direct control. SaaS providers abstract away their backend infrastructure, which can leave you vulnerable to upstream failures. However, there are proactive steps you can take within your control to mitigate these risks and improve system resilience.

SaaS Resilience Improvement Plan

1. Map and document SaaS dependencies
Create and maintain an up-to-date inventory of all SaaS services your system relies on, both directly and indirectly. Include details such as the underlying cloud infrastructure (e.g., AWS, GCP), regional hosting, and the criticality of each service to your operations.

2. Implement client-side circuit breakers and retries
Add fault-tolerance mechanisms in your frontend and backend code, such as circuit breakers, timeouts, exponential backoff retries, and fallback UIs. This ensures that transient SaaS outages do not fully break your user experience.

3. Cache critical data locally
For high-availability features (e.g., product catalog, user settings), implement edge or client-side caching strategies. This allows your system to serve stale-but-usable data if upstream SaaS services are temporarily unavailable.

4. Set up independent monitoring and alerting
Do not rely solely on the provider’s status pages. Implement external health checks and synthetic monitoring to independently track the availability and performance of critical third-party services.

5. Enable redundant SaaS providers (where feasible)
For high-risk areas such as authentication, payments, or observability, consider integrating with secondary SaaS providers that can be switched to manually or programmatically during outages. Be mindful that this can increase complexity and may require handling differences between providers.

6. Configure multi-region deployment for services under your control
Where you manage infrastructure or use PaaS providers (e.g., Vercel, Firebase), ensure that deployments span multiple regions. Avoid over-reliance on a single cloud region, such as AWS us-east-1.

7. Use event-driven buffering for critical workflows
Decouple workflows using queues or message buffers (e.g., SQS, Kafka, Durable Queues) so that temporary upstream failures do not result in data loss or dropped transactions.

8. Test system resilience with chaos engineering
Regularly simulate SaaS outages (e.g., temporarily disabling a key API) to test how your system behaves under failure conditions and identify points of fragility before a real outage occurs.

9. Establish offline-friendly workloads
Where possible, allow users to continue working in a limited or offline mode—especially in mobile apps or agent consoles—and sync data back once the upstream SaaS service recovers.

10. Monitor and enforce SaaS SLAs
Track uptime, latency, and incident response of critical SaaS providers. Ensure they meet their SLA commitments, and escalate contractually or operationally if violations become frequent.

These strategies will not eliminate risk entirely, and that’s okay. But they can significantly reduce exposure so that when the unexpected happens, you’re not scrambling—you’re calmly sipping a cup of tea.

I built an AI Agent That reveals Wall Street Sentiment in seconds

Indika_Wimalasuriya — Sun, 31 Aug 2025 21:08:32 +0000

This is a submission for the AI Agents Challenge powered by n8n and Bright Data

What I Built

I built an AI agent that aggregates and analyzes sentiment from:

All online sources
Major US stock market news sites
X.com user posts

It delivers a consolidated Wall Street sentiment report in seconds. The agent scans trending trader discussions, financial headlines, and online chatter, then generates a concise, actionable summary—sent directly via email—so investors can understand market sentiment instantly without spending hours researching.

Demo

n8n Workflow

GitHub - https://github.com/wimalasuriyaib/WallStreetSentimentAnalyzer

Wall Street Sentiment Analyzer - This is my main Agent workflow.

Online Stock Market Sentiment Workflow

X.com Stock Market Sentiment Workflow

Agent Capabilities

Final ouput - US Stock Market Sentiment Report

Watch the demo video for walk through of the Agent

Technical Implementation

Overview

The workflow is designed to automatically fetch real-time US stock market sentiment, process the data, and generate a concise word summary suitable for a blog post. It leverages BrightData for data extraction, Google AI for querying, and Google Gemini (PaLM API) for text summarization. The workflow is fully automated and orchestrated using n8n, an open-source workflow automation tool.

System Instructions

Triggering: Manual execution via the Manual Trigger node or can be scheduled with a Cron node.

Data Collection: BrightData Web Scraper Node: Sends a query to the BrightData dataset API to extract market sentiment from a specified URL using a predefined prompt

Snapshot Monitoring: The workflow waits for the BrightData snapshot to be ready and monitors its progress using the Check Snapshot Status node.

Data Handling: Once the snapshot is ready, the Download Snapshot Content node retrieves the data.

Edit Fields Node: Normalizes the output JSON and extracts the relevant answer_text for further processing.

Data Summarization: Google Gemini Node: Passes the extracted text to the Gemini model (models/gemini-2.0-flash) to generate a concise word summary suitable for a blog post.

Prompts are dynamically injected from the snapshot content for contextual summarization.

Model Choice

Google Gemini (PaLM API) – Selected for its ability to generate human-like, high-quality text summaries and handle complex financial language and sentiment analysis effectively.

Memory / Data Handling

Workflow uses pinData to store intermediate data (answer_text) securely within n8n.

Each node is stateless, relying on BrightData snapshots to maintain consistency and reproducibility.

Workflow design ensures error handling via conditional checks (IF nodes) to retry waiting or snapshot download until data is ready.

Tools Used

n8n: Orchestrates the workflow, manages triggers, and passes data between nodes.

BrightData: Handles data extraction from dynamic websites using snapshots. Provides monitoring APIs to ensure completeness and accuracy.

Google Gemini (PaLM API): Processes raw sentiment text. Produces coherent, concise summaries ready for blog publishing.

Workflow Highlights

Asynchronous snapshot handling: Ensures the workflow doesn’t fail if data isn’t ready immediately.
Dynamic prompt injection: Allows custom queries without modifying the workflow logic.
Seamless integration: BrightData and Google Gemini nodes are fully credentialed and reusable for multiple datasets or sentiment sources.
Scalable design: Can be extended to multiple stock tickers, social media sentiment, or regional markets by adjusting the query parameters.

Future Enhancements

Integrate automated blog publishing via WordPress or Medium APIs.
Add historical sentiment tracking and trend analysis.
Incorporate alerts or notifications if sentiment changes drastically.

Bright Data Verified Node

The Bright Data Verified Node is a critical component in our stock market sentiment workflow, providing reliable and scalable web data extraction without the typical challenges of web scraping. By leveraging Bright Data, we can trigger dataset snapshots, monitor their progress in real-time, and download structured results automatically. This eliminates the need for building custom scraping pipelines, handling IP rotation, or managing proxy networks—tasks that are notoriously error-prone and time-consuming.

Using Bright Data ensures high data accuracy and compliance, which is particularly important when accessing dynamic and frequently updated sources like Google AI search results. Without it, we would face the complexity of dealing with anti-bot mechanisms, frequent source changes, and the overhead of continuously maintaining scraping scripts. Such manual approaches often result in inconsistent data, higher failure rates, and significant delays in processing, all of which could compromise the quality of downstream AI analysis.

By integrating the Verified Node, our solution gains reliability, speed, and maintainability. The node abstracts away the operational burdens of web data extraction, allowing us to focus on extracting insights, summarizing market sentiment with AI, and generating actionable content. Bright Data, therefore, transforms what could be a fragile, labor-intensive process into a seamless, scalable workflow.

Journey

Participating in this hackathon was an incredibly exciting experience, as it was my first time using both n8n and Bright Data. I began by spending several hours watching n8n videos. Since n8n is a low-code solution, my initial approach was to jump straight into building—but I quickly realized I lacked the basics and failed miserably after a few hours. While it is designed to accelerate development, mastering the fundamentals is essential.

I then went through the n8n Beginner Course on YouTube at high speed and complemented it by using freely available templates to build small projects for hands-on practice. I took a similar approach with Bright Data, experimenting with small projects to get comfortable with its capabilities.

Once I felt confident with both tools, I defined my problem statement: capture Wall Street sentiment analysis in seconds. Developing this stock market sentiment workflow was both challenging and rewarding. The initial goal was to capture real-time investor sentiment reliably and convert it into actionable AI-driven insights. A major hurdle was handling dynamic web content, especially from sources like Google AI Search, which frequently change and block automated requests. Without a robust solution, scraping would have been slow, error-prone, and difficult to maintain.

Integrating Bright Data’s Verified Node was a game-changer. It provided a secure, compliant, and scalable way to trigger dataset snapshots, monitor progress, and retrieve structured results effortlessly. This eliminated the need to manually manage proxies, IP rotations, and anti-bot measures.

Processing large amounts of unstructured text data was another challenge. Leveraging Google Gemini (PaLM) for summarization enabled us to convert raw responses into concise, high-quality 200-word blog posts. Combining Bright Data’s reliability with AI-powered summarization streamlined the workflow and significantly reduced operational complexity.

Since Bright Data can’t be directly added as an agent tool, I created two separate workflows: one to gather sentiment from online content and another from users on x.com. It took me some time to figure this out, but once implemented, completing the project became much faster.

The Hackception: Mini Hackathon Inside the Hackathon

A major highlight of this hackathon was involving students from the University of Peradeniya. During the Code to Cloud training program, I decided to run a mini hackathon within the main hackathon, introducing students to n8n and Bright Data. We launched it on Friday (just two days to go), conducted walkthroughs of sample projects, and then let students develop their own workflows. So far, one student has submitted their project, and I expect more submissions as the deadline approaches. To make it more exciting, we offered two free tickets to AWS Community Day Sri Lanka for students who delivered strong projects.

*Check out my quick overview video on n8n and BrightData on YouTube: *

This journey reinforced the value of automation, scalability, and robust integrations, allowing me to focus on insights rather than infrastructure. Working with n8n was empowering, enabling rapid development of agentic solutions, while Bright Data simplified web data collection immensely. Overall, I gained deep technical knowledge, built a functional stock sentiment workflow, and successfully ran a hackathon inside a hackathon, inspiring the next generation of tech enthusiasts. I couldn’t ask for a more fulfilling experience.

Amazon Cognito Observability Best Practices with Datadog

Indika_Wimalasuriya — Sun, 10 Aug 2025 12:01:52 +0000

Amazon Cognito is an user authentication and authorization service that lets you enable sign-up, sign-in, and access control for your web and mobile systems. Cognito handles user accounts, password recovery, multi-factor authentication, and more. It also allows integration with popular single sign-on (SSO) services such as Google, Facebook, and Apple. Finally, one of its most important features is the ability to scale to millions of users.

There Are Two Types of Cognito Pools

1. User Pools
User Pools handle user sign-up, sign-in, and authentication. They act as a user directory managed by AWS. User Pools provide features such as multi-factor authentication (MFA), password policies, and integration with identity providers like Google, Apple, SAML, and OIDC.

The output of a User Pool is a set of tokens for authenticated users: an ID token (JWT), an access token, and a refresh token.

2. Identity Pools (also known as Federated Identities)
Identity Pools provide temporary AWS credentials that allow authenticated users to access AWS resources directly. They can work with User Pools or other identity providers.

Identity Pools can federate identities from multiple sources into a single AWS identity. They use AWS Security Token Service (STS) to issue temporary AWS access keys based on assigned IAM roles.

In this blog post, I will walk through observability in Amazon Cognito User Pools.

First things first: Cognito observability mainly relies on two types of telemetry data — metrics and logs. Let’s go through them in detail.

Amazon Cognito Metrics

You can enable the Amazon Cognito – Datadog integration to collect and monitor these metrics.

This integration will enable below metrics:

1. Sign-In Metrics

Measure user authentication activity and throttling.

✅ Sign-in Success % → aws.cognito.sign_in_successes
📊 Sign-in Requests → aws.cognito.sign_in_successes.samplecount
🏆 Successful Sign-ins → aws.cognito.sign_in_successes.sum
🚫 Throttled Sign-ins → aws.cognito.sign_in_throttles

2. Sign-Up Metrics

Track new user registrations and throttling.

✅ Sign-up Success % → aws.cognito.sign_up_successes
📊 Sign-up Requests → aws.cognito.sign_up_successes.samplecount
🏆 Successful Sign-ups → aws.cognito.sign_up_successes.sum
🚫 Throttled Sign-ups → aws.cognito.sign_up_throttles

3. Token Refresh Metrics

Monitor token refresh performance and throttling.

✅ Token Refresh Success % → aws.cognito.token_refresh_successes
📊 Token Refresh Requests → aws.cognito.token_refresh_successes.samplecount
🏆 Successful Token Refreshes → aws.cognito.token_refresh_successes.sum
🚫 Throttled Token Refreshes → aws.cognito.token_refresh_throttles

4. Federation Metrics

Track identity federation success and throttling.

✅ Federation Success % → aws.cognito.federation_successes
📊 Federation Requests → aws.cognito.federation_successes.samplecount
🏆 Successful Federation Requests → aws.cognito.federation_successes.sum
🚫 Throttled Federation Requests → aws.cognito.federation_throttles

5. Risk & Security Metrics

Measure detected risks and blocked requests.

⚠️ Account Takeover Risk → aws.cognito.account_take_over_risk, aws.cognito.account_takeover_risk
🔐 Compromised Credential Risk → aws.cognito.compromised_credential_risk, aws.cognito.compromised_credentials_risk
🟢 No Risk Detected → aws.cognito.no_risk
🛑 Any Risk Detected → aws.cognito.risk
⛔ Blocked by Config → aws.cognito.override_block

Amazon Cognito Logs

Amazon Cognito supports two plans, including the Plus plan — an enhanced set of user pool features designed for applications that require advanced security options. The Plus plan enables logging and analysis of user activity. It allows you to access logs, risk ratings, and CloudWatch metrics related to user authentication activity within your user pool.

You need to configure your Datadog Lambda forwarder function in AWS and add Cognito logs as a trigger to send the logs to Datadog.

This will enable you to receive Cognito logs in Datadog.

Amazon Cognito Log Attributes are as follows :

Group	Attribute	Description
User & Identity Information	🔶 userName	The username involved in the event.
	userSub	Unique UUID assigned to the user in the User Pool.
	idpName	Identity Provider name (e.g., Google, Facebook, SAML).
	clientId	App client ID used for the request.
	userPoolId	Cognito User Pool identifier.
	id	Internal log event identifier.
Event Context	🔶 eventType	Type of event (e.g., SignUp, SignIn, PasswordChange).
	eventSource	Source of the event (e.g., USER_AUTH_EVENTS).
	🔶 eventResponse	Event status (e.g., Pass, Fail, InProgress).
	eventId	Unique ID for the event.
	eventTimestamp / timestamp	Event time in epoch milliseconds.
	creationDate	Date/time the event record was created.
	challenges	Authentication challenges and outcomes (e.g., Password:Success).
Risk & Security Signals	🔶 riskDecision	Risk analysis result (e.g., PASS, FAIL, BLOCK).
	compromisedCredentialDetected	Whether compromised credentials were detected (true/false).
	riskLevel	Level of risk detected (e.g., Low, Medium, High).
Client & Location Data	🔶 ipAddress	IP address of the client making the request.
	city	City from IP geolocation.
	country	Country from IP geolocation.
	deviceName	Browser and OS details (e.g., Chrome 138, Windows 10).
Logging & Invocation Data	logLevel	Log severity level (e.g., INFO, WARN, ERROR).
	host	Host name (e.g., cognito).
	service	AWS service producing the log (e.g., cloudwatch).
	version	Log event schema version.
	invoked_function_arn	ARN of the Lambda function processing/forwarding the log.
	logSourceId.userPoolId	User Pool ID from log source metadata.
	requestId	AWS request identifier for the service call.
Feedback Data (Optional)	eventFeedbackDate	Date feedback was recorded.
	eventFeedbackProvider	Entity providing the feedback.
	eventFeedbackValue	Feedback result (e.g., Valid, Invalid).
Miscellaneous	hasContextData	Boolean indicating additional context data availability.

Username, Event Type, Event Response, Risk Decision, and IP Address are the most commonly used attributes that you can use to create rich custom metrics to facilitate many of your fine-grained drill-down needs.

Finally, it's about creating a Service Level Indicator (SLI) dashboard that provides a business perspective on things.

Troubleshooting Cognito-Related Issues: Best Practices

Use Amazon Cognito Plus Tier
The Cognito Plus tier is highly recommended.
It enables log delivery and provides risk-based metrics such as riskDecision, eventType, and eventResponse, which are essential for troubleshooting authentication and security issues.
Build Custom Metrics Using Logs in Datadog
Cognito logs come with rich attributes (refer to the previous table), allowing you to create powerful custom metrics.
These metrics can offer deep visibility into user behavior, login patterns, error spikes, and other critical insights.
Set Up SLI and SLO Dashboards
It's important to translate technical metrics into your business context — in other words, what your end users are actually experiencing.
This allows you to build meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that track reliability from a user-focused perspective.

That's a wrap on my AWS Cognito Observability Guide. Use these best practices to improve visibility, reduce troubleshooting time, and align system metrics with business goals.

Amazon API Gateway Observability Best Practices with Datadog

Indika_Wimalasuriya — Sun, 03 Aug 2025 04:37:54 +0000

AWS API Gateway is a fully managed service from AWS that allows you to create, publish, and maintain APIs at any scale. It acts as a gateway to your application's backend services, including AWS Lambda, EKS, ECS, EC2, and more.

You can explore the full documentation here:
🔗 API Gateway Developer Guide - You can refer all the details you wants related to API Gateway here

To make sure we’re aligned on the fundamentals, I’ve created an API Gateway Essentials summary below. It gives you a quick overview of the core capabilities this service offers.

The main objective of this blog is to walk through how to monitor and observe AWS API Gateway using Datadog — one of the leading observability platforms that provides full-stack visibility into AWS environments.

Before diving in, a quick refresher:
Observability is the practice of using telemetry data (logs, metrics, and traces) to understand a system’s internal state. In this case, we’ll leverage API Gateway’s logs, metrics, and traces to gain insights into what’s really happening under the hood.

API Gateway Logs
AWS provides built-in support for enabling logs. You can enable them under API Gateway → Stages, where logging options are available for both access logs and execution logs.

Once logging is enabled, you can configure API Gateway to send logs to Datadog.

Configuration guide: Datadog + API Gateway Integration

Why Logs Matter
Logs are essential for troubleshooting issues in API Gateway. In most cases, failures fall into one of two categories:

Backend-related issues
Unresponsive services (e.g., Lambda, EC2, EKS) or misconfigurations such as timeouts or incorrect integration responses.

AWS infrastructure-level issues (rare)
These could include internal AWS errors or regional service disruptions.

Common Causes of API Gateway Failures

Misconfigured integrations (e.g., VPC links, request/response mapping templates)
Backend timeouts
Incorrect or missing HTTP status code mappings

API Gateway Metrics

AWS provides a rich set of metrics for API Gateway that align with the three golden signals of observability: traffic, errors, and latency. These metrics are essential for monitoring the health, performance, and reliability of your APIs — helping you detect issues early and respond proactively.

API Gateway Metrics – Grouped Summary

Type	Metric	Description
Traffic	`aws.apigateway.count`	Total number of API requests received
	`aws.apigateway.count.p50` - `.p99`	Percentile distribution of request count
	`trace.aws.apigateway.hits`	Total hits from traces
	`trace.aws.apigateway.hits.by_http_status`	Hits grouped by HTTP status code
	`trace.aws.apigateway.stage.hits`	Hits per deployment stage
	`trace.aws.apigateway.stage.hits.by_http_status`	Stage-level hits by HTTP status
Errors	`aws.apigateway.4xxerror`	Client-side errors (e.g., invalid request, unauthorized)
	`aws.apigateway.4xxerror.p50` - `.p99`	Percentiles of 4xx error rates
	`aws.apigateway.5xxerror`	Server-side/API errors (e.g., backend failure)
	`aws.apigateway.5xxerror.p50` - `.p99`	Percentiles of 5xx error rates
Latency	`aws.apigateway.latency`	Total time from request to response (includes backend)
	`aws.apigateway.latency.p50` - `.p99`	Percentile breakdown of total latency
	`aws.apigateway.latency.minimum` / `.maximum`	Min and max observed latency values
Integration Latency	`aws.apigateway.integration_latency`	Time spent in the backend integration only
	`aws.apigateway.integration_latency.p50` - `.p99`	Percentile breakdown of backend latency
	`aws.apigateway.integration_latency.minimum` / `.maximum`	Min and max integration latency
Tracing / Duration	`trace.aws.apigateway.duration`	Trace-based total API duration
	`trace.aws.apigateway.duration.by_http_status`	Duration per status code
	`trace.aws.apigateway.stage.duration`	Duration per stage
	`trace.aws.apigateway.stage.duration.by_http_status`	Stage duration by status code
Tracing / Apdex	`trace.aws.apigateway.stage.apdex`	User satisfaction score (Apdex) per stage
Meta	`trace.aws.apigateway`	Base trace for API Gateway
	`trace.aws.apigateway.stage`	Trace identifier for specific stage

API Gateway Tracing

A best practice is to enable tracing for Application Performance Monitoring (APM) on your backend services—such as AWS Lambda or microservices running on ECS, EKS, or EC2. Enabling tracing automatically provides you with the API Gateway tracer view, giving detailed insights into the flow and performance of your APIs.

In the example below, I have enabled tracing for an AWS Lambda backend, which allows me to view the API Gateway trace data.

The example below shows a trace starting from API Gateway, capturing the end-to-end flow through the backend Lambda function and any other integrated services

Service Level Indicator (SLI) Dashboard for API Gateway

Finally, you need to bring everything together and create a single source of truth dashboard for API Gateway, which provides insights into traffic, errors, and latency. It should include request volume and trends to help identify potential issues promptly.

The dashboard should also highlight:

Failed traces

Traces taking more than x seconds — useful for identifying slow requests passing through API Gateway that require further investigation

Relevant logs for deeper analysis

A combination of all these elements will give you a comprehensive view of your API Gateway, enabling effective monitoring and faster troubleshooting of any potential failures or performance issues.

And that wraps up a complete guide to achieving observability for Amazon API Gateway using Datadog.

CloudFront Observability Best Practices with Datadog

Indika_Wimalasuriya — Wed, 02 Jul 2025 08:51:56 +0000

Amazon CloudFront is Amazon's own Content Delivery Network (CDN), designed to speed up content delivery to users by distributing it across a global network of edge locations. CloudFront caches content closer to users, thereby reducing latency.

You can explore the CloudFront full documentation here:

To make sure we’re aligned on the fundamentals, I’ve created an CloudFront Gateway summary below. It gives you a quick overview of the core capabilities this service offers.

When using Amazon CloudFront, it’s essential to enable complete visibility into what’s happening at that layer.

Leverage CloudFront Metrics for Performance and Latency Observability
Start with the default CloudFront metrics, which give valuable insights

Requests – Tracks the number of HTTP/HTTPS requests.

Total Error Rate – Monitors the overall error rate, including both 4xx and 5xx errors.

4xx and 5xx Error Rate – Separates client and server errors for more granular analysis.

Bytes Downloaded/Uploaded – Helps track data volume and monitor trends.

To get deeper visibility, enable additional CloudFront metrics

Cache Hit Rate – Shows the percentage of requests served from the cache.

Origin Latency – Measures how long CloudFront takes to start responding when content comes from the origin (not the cache).

Error Rate by Status Code – Breaks down errors further (e.g., 401, 403, 502) for precise troubleshooting.

These metrics give you a clear view of what’s happening inside your CloudFront distribution.

To enable CloudFront metrics, first complete the Datadog AWS integration via the Datadog Integrations page, and then enable CloudFront metrics

You will be able to see the CloudFront metrics via the Datadog Metrics Explorer.

Use CloudFront Logs to Accelerate Troubleshooting

In order to ship CloudFront logs to Datadog, you need to configure the Datadog Forwarder Lambda function, add a trigger, and set up CloudFront as a log source.

Datadog AWS Log Forwarder Configuration

Enable CloudFront access logs (delivered to Amazon S3) to analyze user behavior and troubleshoot issues.

Logs help you observe:

Cache Optimization – Improve cache hit/miss rates to maximize CDN benefits.

Traffic Patterns – Understand who is accessing your content, from where and when.

Performance Issues – Identify regions or requests experiencing high latency.

Error Analysis – Discover why certain requests fail or aren't cached.

Security – Detect suspicious activity or unauthorized access attempts.

Enable Tracing for Code-Level Visibility

Enable tracing tools such as AWS X-Ray or Datadog APM to trace requests across services. This allows you to:

Pinpoint performance bottlenecks
See what’s happening inside your code during a request
Correlate CloudFront performance with backend services
Tracing adds depth to your observability stack and helps you find issues faster.

Bringing It All Together

Combining CloudFront Metrics, Logs, and Traces gives you complete observability of your CDN layer.

What’s Next: Turn Observability into Action

Once you have visibility, use it to continuously improve:

✅ Optimize Cache Hit Rate
Analyze cache behavior (hit/miss ratio). The goal of CloudFront is to serve the majority of requests from the cache, which improves speed and reduces origin load. Monitor trends and assess how new deployments affect caching. Constant observation leads to measurable improvements.

✅ Fine-Tune Cache Configuration
Review and adjust cache TTLs, headers, cookies, and query string settings. Use cache policies and origin request policies for better control and efficiency.

✅ Identify and Resolve Latency Hotspots
Use Origin Latency metrics to detect slow origins or network bottlenecks. Continuously monitor and improve based on findings.

✅ Set Up Alerts
Configure alerts for high error rates (4xx/5xx), increasing latency, or dropping cache hit ratios. Early alerts help resolve issues before they impact users.

✅ Use Geo and Device Insights
Analyze where your traffic comes from and what devices are used. This helps optimize delivery strategies and detect anomalies or unauthorized access.

✅ Correlate Data Across Services
Link CloudFront data with backend services for end-to-end observability. Enabling tracing across services provides a full picture of request flows and system health.

With the right combination of metrics, logs, and traces, you can unlock powerful insights into your CloudFront performance, troubleshoot faster, and continuously improve user experience.