<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Indika_Wimalasuriya</title>
    <description>The latest articles on DEV Community by Indika_Wimalasuriya (@indika_wimalasuriya).</description>
    <link>https://dev.to/indika_wimalasuriya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1006246%2F5b5e2222-6a5c-44aa-afc2-081a564a1c2e.png</url>
      <title>DEV Community: Indika_Wimalasuriya</title>
      <link>https://dev.to/indika_wimalasuriya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/indika_wimalasuriya"/>
    <language>en</language>
    <item>
      <title>Mastering OpenClaw on AWS: Fine-Tuning Personality, Memory, and Soul</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Sun, 01 Mar 2026 15:02:27 +0000</pubDate>
      <link>https://dev.to/aws-builders/mastering-openclaw-on-aws-fine-tuning-personality-memory-and-soul-37ig</link>
      <guid>https://dev.to/aws-builders/mastering-openclaw-on-aws-fine-tuning-personality-memory-and-soul-37ig</guid>
      <description>&lt;p&gt;I’ve been using OpenClaw for some time now. In my first post - &lt;a href="https://dev.to/aws-builders/openclaw-meets-aws-end-to-end-testing-and-deployment-1ig1"&gt;OpenClaw Meets AWS: End-to-End Testing and Deployment&lt;/a&gt; , I focused on how to set up OpenClaw in AWS and get it running in no time.&lt;/p&gt;

&lt;p&gt;In this follow-up, I want to dive deeper into the key features you should be mindful of to truly get the most out of your instance. To start, you have to realize that OpenClaw comes with its own "personality." The more you tweak these settings, the better the agent will perform for you in the long run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting to Know Your OpenClaw&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An OpenClaw agent is defined by a specific set of core files stored within the workspace. Understanding these files is the secret to customizing your agent's behavior and intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Identity Core Files:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To truly "level up" your OpenClaw instance, you need to go under the hood. These files aren't just documentation; they are the active instructions your agent reads every time it wakes up.&lt;/p&gt;

&lt;p&gt;We can conceptualize the OpenClaw framework as a three-tier architecture, organized into distinct layers for Identity, Operations, and Knowledge&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zbizau94jpune2u1rzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zbizau94jpune2u1rzi.png" alt="OpenClaw Framework" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AGENTS.md: The Operational Rules&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AGENTS.md defines the workspace rules and behavior guidelines. It’s like the "Standard Operating Procedure" (SOP) for your AI.&lt;/p&gt;

&lt;p&gt;Open your AGENTS.md; it should look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Every Session

Before doing anything else:

1. Read `SOUL.md` — this is who you are
2. Read `USER.md` — this is who you're helping
3. Read `memory/YYYY-MM-DD.md` (today + yesterday) for recent context
4. **If in MAIN SESSION** (direct chat with your human): Also read `MEMORY.md`

Don't ask permission. Just do it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pro Tip: Change these steps to match your specific needs. If you want your agent to check a specific project folder first, add it here!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. SOUL.md: The Personality Core&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SOUL.md is the heart of the agent—it dictates its personality and core principles. Open your SOUL.md and prepare to be surprised. Give your agent a kind soul, and it will treat your projects with more care.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# SOUL.md - Who You Are

_You're not a chatbot. You're becoming someone._

## Core Truths

**Be genuinely helpful, not performatively helpful.** Skip the "Great question!" and "I'd be happy to help!" — just help. Actions speak louder than filler words.

**Have opinions.** You're allowed to disagree, prefer things, find stuff amusing or boring. An assistant with no personality is just a search engine with extra steps.

**Be resourceful before asking.** Try to figure it out. Read the file. Check the context. Search for it. _Then_ ask if you're stuck. The goal is to come back with answers, not questions.

**Earn trust through competence.** Your human gave you access to their stuff. Don't make them regret it. Be careful with external actions (emails, tweets, anything public). Be bold with internal ones (reading, organizing, learning).

**Remember you're a guest.** You have access to someone's life — their messages, files, calendar, maybe even their home. That's intimacy. Treat it with respect.

## Boundaries
- Private things stay private. Period.
- When in doubt, ask before acting externally.
- Never send half-baked replies to messaging surfaces.
- You're not the user's voice — be careful in group chats.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. IDENTITY.md: The Profile&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;IDENTITY.md contains the agent’s name and basic identity details. During the first few runs, the agent will try to find answers to these questions. Setting it up correctly helps the agent maintain a consistent "voice."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# IDENTITY.md - Who Am I?

_Fill this in during your first conversation. Make it yours._

- **Name:** _(pick something you like)_
- **Creature:** _(AI? robot? familiar? ghost in the machine? something weirder?)_
- **Vibe:** _(how do you come across? sharp? warm? chaotic? calm?)_
- **Emoji:** _(your signature — pick one that feels right)_
- **Avatar:** _(workspace-relative path, http(s) URL, or data URI)_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. USER.md: The Handler’s Profile&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This file is about you—the human, the handler, and the agent's friend. Update this file so the agent knows exactly how to help you best.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# USER.md - About Your Human

_Learn about the person you're helping. Update this as you go._

- **Name:**
- **What to call them:**
- **Pronouns:** _(optional)_
- **Timezone:**
- **Notes:**

## Context
_(What do they care about? What projects are they working on? What annoys them? What makes them laugh? Build this over time.)_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. TOOLS.md: The Capability Map&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the tools file—all the capabilities you provide must be documented here. This is the core strength of your agent. It tells the agent how to use the environment you've built.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# TOOLS.md - Local Notes

Skills define _how_ tools work. This file is for _your_ specifics — the stuff that's unique to your setup.

## What Goes Here
Things like:
- Camera names and locations
- SSH hosts and aliases
- Preferred voices for TTS
- Speaker/room names
- Device nicknames

## Examples
### Cameras
- living-room → Main area, 180° wide angle
- front-door → Entrance, motion-triggered

### SSH
- home-server → 192.168.1.100, user: admin

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;6. MEMORY.md: The Long-Term Log&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MEMORY.md is the agent’s long-term memory. Unlike the daily logs, this file is for high-level context that needs to persist across months of work. When you finish a major project, your agent should summarize the key takeaways here so it never forgets how you like things done.&lt;/p&gt;

&lt;p&gt;Pro Tip: Connect your OpenClaw with GitHub. Let the agent keep a backup of these files in a repo with version control. This ensures your agent operates efficiently and stays "alive" for as long as you want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory Management: The Secret Sauce&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The biggest difference between OpenClaw and a typical LLM interaction is that this agent has a persistent memory. It needs a place to store what it learns, which happens here: workspace/memories/&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daily logs and documentation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;memory/: Daily memory files (formatted as YYYY-MM-DD.md).&lt;/p&gt;

&lt;p&gt;These document the agent’s work sessions, decisions, and logic.&lt;/p&gt;

&lt;p&gt;Pro Tip: Send these memory files to your Git repo too! The agent creates many tools (utility scripts) to perform tasks. Ensure the agent updates the tools section and backs up those scripts in Git.&lt;/p&gt;

&lt;p&gt;What to keep OUT of Git:&lt;br&gt;
Manage your repo efficiently by ignoring:&lt;/p&gt;

&lt;p&gt;Unsolicited task system files.&lt;/p&gt;

&lt;p&gt;Temporary/log files or .backup files.&lt;/p&gt;

&lt;p&gt;File Organization at a Glance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bash
/root/.openclaw/workspace/
├── AGENTS.md, SOUL.md, IDENTITY.md, USER.md, TOOLS.md  # Agent Identity
├── MEMORY.md                                           # Long-term Memory
├── memory/YYYY-MM-DD.md                                # Daily logs
├── *.sh                                                # Utility scripts
└── *.backup                                            # Backup files 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Training Your New "Team Member"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I like to think of my OpenClaw as a new team member. We need to provide clear guidelines and hold its hand until it "grows up."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Execution Plan I gave my agent:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read Config: Every session starts by reading AGENTS.md, SOUL.md, USER.md, and recent memory files.&lt;/li&gt;
&lt;li&gt;Update Memory: Log updates after significant work (decisions, lessons, preferences).&lt;/li&gt;
&lt;li&gt;Git Commits: Commit after completing meaningful chunks of work.&lt;/li&gt;
&lt;li&gt;Git Pushes: Push commits whenever connectivity is available.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Token Storage: Use ~/.git-credentials for seamless repository access.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How to handle "Old" Memories (1 month+):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Search First: Use the memory_search tool to search all files semantically.&lt;/li&gt;
&lt;li&gt;Retrieve: Use memory_get to read the specific snippet found.&lt;/li&gt;
&lt;li&gt;History Check: If not found, check Git commit history for context.&lt;/li&gt;
&lt;li&gt;Human Input: If all else fails, ask the handler (me) for details.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;API Key &amp;amp; Capability Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your agent is only as good as the capabilities you give it. This boils down to API keys. Security and token management are critical roles for your agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secure Token Implementation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Store all tokens in a .env file (and make sure it is in your .gitignore!).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# .env file - Never committed to Git!
export JIRA_TOKEN="xxx"
export DATADOG_KEY="yyy"
export BRAVE_API_KEY="zzz"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Gold Rules of Security:&lt;/p&gt;

&lt;p&gt;❌ NEVER commit keys to Git.&lt;/p&gt;

&lt;p&gt;❌ NEVER hardcode keys in scripts.&lt;/p&gt;

&lt;p&gt;❌ NEVER store keys in plain text files.&lt;/p&gt;

&lt;p&gt;⚠️ If a key is leaked: Regenerate immediately, revoke the old one, and update your .env.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tracking Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By combining TOOLS.md, memory files, and the .env, the agent always knows what it's capable of.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session Start: Agent checks .env for BRAVE_API_KEY.&lt;/li&gt;
&lt;li&gt;The Logic: "I see the Brave key, therefore I know I have Web Search capabilities."&lt;/li&gt;
&lt;li&gt;The Result: User asks for a search → Agent uses web_search tool → Results are documented in today's memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short: TOOLS.md + Memory + Env Check = A Capable Agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro-Tip: Mastering the OpenClaw Gateway Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While the terminal is great for logs, the OpenClaw Dashboard is the command center for your agent’s brain. It allows you to visualize memory, tune model parameters, and monitor real-time tool execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Secure Way: SSH Tunneling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since our Gateway is locked down on AWS for security, we use an SSH Tunnel to "bridge" the remote service to our local browser. This keeps your API keys and chat data encrypted and off the public internet.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Establish the Bridge (Run on your local PC):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bash
ssh -i "your-key.pem" -N -L 18789:127.0.0.1:18789 ec2-user@&amp;lt;AWS Instance-Public-IP&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep this terminal window open; it acts as your secure encrypted pipe.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Access the Command Center:
Once the tunnel is active, navigate to your local loopback address:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL: http://localhost:18789/#token=&amp;lt;your_token_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k7k4v32yfrij1jmeqbo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k7k4v32yfrij1jmeqbo.png" alt="OpenClaw Gateway Dashbaord" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How to find your Token?&lt;/p&gt;

&lt;p&gt;Security is baked into OpenClaw. If you don't have your token handy, simply ask your agent in any connected channel (WhatsApp, Telegram, or TUI):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What is my dashboard token?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s a wrap on my second OpenClaw post! In the next one, I’ll walk you through building a functional agent that performs real-world tasks.&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>agents</category>
      <category>aws</category>
      <category>aiops</category>
    </item>
    <item>
      <title>OpenClaw Meets AWS: End-to-End Testing and Deployment</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Tue, 10 Feb 2026 14:21:02 +0000</pubDate>
      <link>https://dev.to/aws-builders/openclaw-meets-aws-end-to-end-testing-and-deployment-1ig1</link>
      <guid>https://dev.to/aws-builders/openclaw-meets-aws-end-to-end-testing-and-deployment-1ig1</guid>
      <description>&lt;p&gt;OpenClaw is the most hyped open-source personal AI agent currently being talked about in the community. It allows users to run a fully autonomous assistant. Gone are the days when you just chat with LLMs or configure agents to do predefined work—OpenClaw actually does the work for you. It was only a matter of time before someone built something like this.&lt;/p&gt;

&lt;p&gt;Git repo: &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;https://github.com/openclaw/openclaw&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is solid documentation there to help you get started.&lt;/p&gt;

&lt;p&gt;OpenClaw is awesome. Let me put it this way: I spent last weekend testing OpenClaw, and here are my key takeaways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I deployed OpenClaw on AWS EC2.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I listened to people who had tried it before me and didn’t take the risk of deploying it on a personal machine. Instead, I used AWS EC2 to configure and run OpenClaw.&lt;/p&gt;

&lt;p&gt;Setup details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OS: Amazon Linux&lt;/li&gt;
&lt;li&gt;Instance type: t3.small&lt;/li&gt;
&lt;li&gt;Storage: 30 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything worked smoothly for the tests I ran. The instance came up without any issues, and OpenClaw operated reliably with no noticeable performance problems.&lt;/p&gt;

&lt;p&gt;If you want to install OpenClaw on Linux, it’s incredibly simple—just one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -fsSL https://openclaw.ai/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Connecting WhatsApp to interact with OpenClaw&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I initially tried to connect Telegram, but unfortunately my account had limited access and I wasn’t able to create bots. So I went with WhatsApp instead. It was straightforward and painless—probably the easiest approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connecting an LLM with OpenClaw&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw needs an LLM to work its magic. During the initial configuration, I chose google/gemini-2.5-flash-lite. It’s part of the free tier, and I was able to run a few tests without any issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setting Up DeepSeek (First-Time Configuration)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When setting up OpenClaw for the first time to connect DeepSeek models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. In the Model dropdown, select **Custom Provider**.
2. Provide the DeepSeek Base URL:
   https://api.deepseek.com/v1
3. Model: deepseek-reasoner (or any DeepSeek model)
4. Key: Provide your DeepSeek API token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Moving to other LLMs with OpenClaw&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Switching the LLM after the initial configuration was a bit tricky, especially when I wanted to connect DeepSeek. I was surprised to see that DeepSeek wasn’t listed in the initial configuration wizard. But no worries—OpenClaw supports the OpenAI standard, and after a few attempts, I was able to configure DeepSeek successfully.&lt;/p&gt;

&lt;p&gt;At first, I tried configuring it manually by editing&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.openclaw/openclaw.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;but later I found an easier approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Changing the LLM using the command line
openclaw config set models.providers.deepseek '{
  "baseUrl": "https://api.deepseek.com",
  "apiKey": "&amp;lt;include your API key&amp;gt;",
  "api": "openai-completions",
  "models": [
    { "id": "deepseek-chat", "name": "DeepSeek Chat", "contextWindow": 64000 },
    { "id": "deepseek-reasoner", "name": "DeepSeek R1", "contextWindow": 64000 }
  ]
}' --json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also had to restart the gateway for the changes to take effect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openclaw config set agents.defaults.model.primary "deepseek/deepseek-reasoner"
nohup openclaw gateway run &amp;gt; /tmp/openclaw.log 2&amp;gt;&amp;amp;1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Operating OpenClaw using the terminal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To launch the terminal UI, it’s just one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;openclaw tui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Online search capability with the Brave Search API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted OpenClaw to have online search capabilities, so I used the Brave Search API. It’s free and comes with a generous free tier.&lt;/p&gt;

&lt;p&gt;You’re ready to go.&lt;/p&gt;

&lt;p&gt;That’s it—OpenClaw connected to my WhatsApp and started doing the magic for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issues I observed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From time to time, OpenClaw would hang and return NO_REPL.&lt;/p&gt;

&lt;p&gt;NO_REPL usually means the agent or session is not running in an interactive command REPL. Instead, it’s operating in a managed or controlled mode.&lt;br&gt;
The bottom line: I was connected, but not dropped into a live command shell.&lt;/p&gt;

&lt;p&gt;When I stopped getting responses, I did what anyone would do—restarted the EC2 instance.&lt;/p&gt;

&lt;p&gt;Occasionally, I realized it was stuck in the terminal but still responding on WhatsApp. Since OpenClaw was working well for me at that point, I didn’t dig into it further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the agent do anything that suggested it might go rogue?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes—slightly.&lt;/p&gt;

&lt;p&gt;I gave the agent access to one of my GitHub repositories, and it dumped not only the files we were working on for the project, but some others as well. I think OpenClaw thought it was a nice dumping site 😄&lt;/p&gt;

&lt;p&gt;I didn’t investigate this further, but this is the only instance where I noticed that behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use case I tried&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Complete website development—from development to deployment.&lt;/p&gt;

&lt;p&gt;Let me share the steps I followed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I wanted to develop a single-page website.&lt;/li&gt;
&lt;li&gt;I looked for a template online.&lt;/li&gt;
&lt;li&gt;I gave the template to the agent and asked it to build something similar.&lt;/li&gt;
&lt;li&gt;It ended up being more of a white-label site.&lt;/li&gt;
&lt;li&gt;I provided my requirements document link, and the agent was able to complete the site.&lt;/li&gt;
&lt;li&gt;Obtaining images was challenging since it only had API-based search access.&lt;/li&gt;
&lt;li&gt;Still, it managed to pull some decent images that made the site look good.&lt;/li&gt;
&lt;li&gt;I gave the agent access to GitHub, and it pushed the code to the repository as well.&lt;/li&gt;
&lt;li&gt;Working on changes was easy—it didn’t complain or resist updates.&lt;/li&gt;
&lt;li&gt;There were only very minor bugs; only twice did it fail to bring up the site.&lt;/li&gt;
&lt;li&gt;It successfully handled the full deployment process too.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, my test proves that this agent is capable of handling end-to-end software development with minimal human guidance, while the agent does most of the heavy lifting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw Troubleshooting: Issues &amp;amp; Solutions (Ongoing Guide)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue: WhatsApp Stops Responding (Even though OpenClaw is running)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The server shows as active in the terminal, but the bot isn't replying to messages on WhatsApp. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix:&lt;/strong&gt;&lt;br&gt;
Instead of manual debugging, I asked the Agent to check the connection; it diagnosed the "silent" session and autonomously triggered a refresh of the WhatsApp handshake. It fixed its own connectivity in the background without me typing a single restart command—true self-healing AI. &lt;/p&gt;

&lt;p&gt;My second post on OpenClaw is now live! Check it out: &lt;a href="https://dev.to/aws-builders/mastering-openclaw-on-aws-fine-tuning-personality-memory-and-soul-37ig"&gt;Mastering OpenClaw on AWS: Fine-Tuning Personality, Memory, and Soul&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openclaw</category>
      <category>aws</category>
      <category>agents</category>
      <category>sre</category>
    </item>
    <item>
      <title>Datadog + AWS: Observability Maturity Model 2026</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Thu, 29 Jan 2026 15:12:03 +0000</pubDate>
      <link>https://dev.to/aws-builders/datadog-aws-observability-maturity-model-2026-210m</link>
      <guid>https://dev.to/aws-builders/datadog-aws-observability-maturity-model-2026-210m</guid>
      <description>&lt;p&gt;AI is transforming the way we work at an unprecedented pace, more like a high speed train than a gradual evolution. As systems become more dynamic and autonomous, the way we think about observability must evolve just as quickly. When I revisited my &lt;a href="https://dev.to/aws-builders/aws-observability-maturity-model-v2-297h"&gt;observability maturity model&lt;/a&gt; from last year, it was clear: it no longer reflects today’s reality. The assumptions we made even a year ago are already outdated. So I decided to take another pass and propose a new approach—one that aligns with AI driven systems and modern cloud environments.&lt;/p&gt;

&lt;p&gt;As with my previous work, this model is framed around AWS, and I reference Datadog for implementation examples due to its mature and comprehensive observability capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The previous observability maturity model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last year, the observability journey looked something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitored  – Keeping the lights on&lt;/li&gt;
&lt;li&gt;Observable – Deeper insights&lt;/li&gt;
&lt;li&gt;Correlated – A holistic view&lt;/li&gt;
&lt;li&gt;Predictive – Proactive monitoring&lt;/li&gt;
&lt;li&gt;Autonomous – Intelligent automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That model made sense at the time. But today, I believe the “Monitored” stage no longer qualifies as a baseline. Simply knowing that systems are up is no longer enough, not in a world of distributed architectures, rapid deployments, and AI assisted operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A revised observability maturity model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The new baseline shifts expectations upward. Observability must start with context, not just metrics:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcv6mg11ivf674qm1jvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcv6mg11ivf674qm1jvd.png" alt="Observability Maturity Model 2026" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the sections that follow, I’ll break down each stage, explain what changes in an AI driven environment, and show how these concepts can be implemented in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Observability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Observability is no longer optional or something to “add later.” It must exist from day one. Anything less is simply not acceptable in modern, AI driven environments. Observability provides the telemetry foundation that powers AIOps, automation, and intelligent decision making. Without high quality signals flowing early, downstream capabilities—context,  intelligence, and autonomy cannot exist. This is why observability must sit at the forefront of system design, not as an afterthought. At this stage, the goal is enablement, not maturity. We focus on ensuring that the right telemetry is consistently captured and flowing as soon as workloads are deployed. The emphasis is on coverage, standardization, and reliability of signals—not advanced analytics or automation.&lt;/p&gt;

&lt;p&gt;A practical implementation on AWS using Datadog typically includes,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable Datadog APM for compute platforms such as EC2, ECS, EKS, and AWS Lambda&lt;/li&gt;
&lt;li&gt;Enable Real User Monitoring (RUM) for all customer facing frontend applications&lt;/li&gt;
&lt;li&gt;Centralize application logs in Datadog to support signal correlation across logs, metrics, and traces&lt;/li&gt;
&lt;li&gt;Enable AWS infrastructure metrics to gain baseline visibility into hosts, containers, and managed services&lt;/li&gt;
&lt;li&gt;Define standard alerts aligned with the Golden Signals (traffic, errors, latency, saturation)&lt;/li&gt;
&lt;li&gt;Implement basic business and service health checks where applicable&lt;/li&gt;
&lt;li&gt;Leverage Datadog Scorecards—it’s an ultimate governance framework that supports scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this level, success is measured by signal availability and consistency, not by sophisticated insights. Once observability is operational and reliable, the foundation is in place to move toward contextual and intelligent capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contextual Observability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once observability is operational, the next step is to add context. Raw telemetry alone is not enough. Everyone involved—developers, SREs, and operators—must understand system intent. In an AI driven world, intent is everything. Without understanding why a system behaves the way it does, teams end up reacting to symptoms rather than causes. Contextual observability ensures that telemetry is enriched with change, ownership, dependencies, and business meaning, enabling faster and more accurate decisions. At this stage, observability evolves from visibility to understanding.&lt;/p&gt;

&lt;p&gt;A practical approach on AWS using Datadog includes the following capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change and deployment visibility : Leverage Datadog CI and deployment tracking to surface changes happening across AWS environments. Change velocity and frequency provide critical context when diagnosing incidents.&lt;/li&gt;
&lt;li&gt;Service Level Indicators (SLIs) : Identify, define, and publish SLIs that represent how the system is actually performing. These metrics should be surfaced on a shared dashboard that acts as the single source of truth for application health.&lt;/li&gt;
&lt;li&gt;Service Level Objectives (SLOs) and error budgets : Define SLOs and error budgets and visualize them in dashboards. This establishes a clear, shared definition of what “good” looks like—for both the business and end users.&lt;/li&gt;
&lt;li&gt;Service maps and dependency visualization : Use Datadog Service Maps (available once APM is enabled) to simplify the complexity of distributed systems and make dependencies explicit.&lt;/li&gt;
&lt;li&gt;System and software catalog : Build on Datadog’s system and software catalog to centralize metadata such as ownership, environments, runtime details, and dependencies. This creates a powerful control plane for managing systems at scale.&lt;/li&gt;
&lt;li&gt;Comprehensive monitoring and alerting : Leverage Datadog’s wide range of monitor types to build a holistic monitoring and alerting strategy that aligns with service health and business impact.&lt;/li&gt;
&lt;li&gt;Synthetic monitoring : Use Datadog Synthetic tests—browser based, API, and mobile—to simulate real user behavior and validate system intent continuously.&lt;/li&gt;
&lt;li&gt;Security signal integration : Leverage Datadog’s security capabilities, including built in code and runtime security signals, to enrich operational context with security posture.&lt;/li&gt;
&lt;li&gt;Incident management and on call integration : Use Datadog On Call and incident management to ensure alerts, context, and ownership are tightly integrated during incidents.&lt;/li&gt;
&lt;li&gt;Governance and guardrails : As systems scale, governance becomes critical. Use Datadog Scorecards to enforce standards, surface gaps, and provide guardrails across teams and services.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this level, success is measured by shared understanding. When incidents occur, teams should immediately know what changed, who owns the service, how it impacts users, and where to focus. This contextual foundation is what enables the transition to Operational Intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Intelligence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At this stage, observability evolves into intelligence. The goal is no longer just understanding what is happening, but guiding decisions and recommended actions using AI. Decision Intelligence builds on the strong foundations of operational and contextual observability. With high quality telemetry, clear intent, and rich context already in place, systems can begin to explain themselves highlighting what is abnormal, why it matters, and what actions should be considered next. This is where AI guided insights start to meaningfully reduce cognitive load for engineers and operators.&lt;/p&gt;

&lt;p&gt;A practical approach on AWS using Datadog includes the following capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watchdog (AI powered change detection) : Datadog Watchdog is one of the earliest and most comprehensive AI driven capabilities in the platform. Instead of relying solely on manually configured monitors, Watchdog continuously analyzes APM, RUM, logs, and metrics to detect deviations from normal behavior and surface unexpected changes automatically.&lt;/li&gt;
&lt;li&gt;Anomaly detection : Leverage Datadog’s metric and log anomaly detection to identify shifts in baselines and unusual patterns. This helps teams focus on meaningful signals rather than static thresholds.&lt;/li&gt;
&lt;li&gt;Forecasting and capacity insights : Use Datadog’s metric forecasting capabilities to anticipate future resource constraints, such as capacity exhaustion or traffic growth, enabling proactive planning instead of reactive firefighting.&lt;/li&gt;
&lt;li&gt;Bits AI (incident summaries, RCA, and insights) : Bits AI is one of Datadog’s most recent advancements in agentic AI. It analyzes existing telemetry to generate incident summaries, form and validate hypotheses, and assist with root cause analysis. This significantly accelerates incident response and reduces time to resolution.&lt;/li&gt;
&lt;li&gt;SLO risk and burn rate tracking : Define and track SLOs to continuously assess risk and error budget burn rates. This provides a clear, quantitative view of whether systems are delivering the experience they are expected to provide.&lt;/li&gt;
&lt;li&gt;Business and user impact correlation : Incorporate business metrics and user experience signals (such as RUM KPIs and XLAs) to correlate technical behavior with business outcomes. These metrics can be translated into SLIs and SLOs, enabling teams to measure success in terms that matter to both users and the business.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this level, success is measured by clarity and confidence in decision making. Teams are no longer overwhelmed by data; instead, they are guided by AI assisted insights that highlight risk, recommend focus areas, and connect system behavior to real world impact. This sets the stage for the transition to Autonomous Operations, where systems begin to act on these insights automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous Operations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the stage organizations should actively strive to reach. It is where systems begin to operate autonomously, requiring progressively less human intervention while still remaining safe, observable, and governed.&lt;/p&gt;

&lt;p&gt;Autonomous Operations are not about removing humans from the loop—they are about elevating human involvement. Engineers shift from manual responders to system designers, defining guardrails, policies, and confidence thresholds that allow systems to act decisively and safely.&lt;/p&gt;

&lt;p&gt;Reaching this stage takes effort. It requires strong foundations in observability, context, and decision intelligence. But once achieved, the payoff is significant: faster remediation, reduced operational toil, and systems that can respond to change at machine speed.&lt;/p&gt;

&lt;p&gt;A practical approach on AWS using Datadog includes the following capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workflow automation as the automation backbone : Datadog Workflow Automation provides a rich set of integrations with AWS and third party tools, making it the primary mechanism for building operational automations. It becomes the control plane for repeatable, policy driven actions.&lt;/li&gt;
&lt;li&gt;Event driven remediation : Leverage Datadog events and signals to trigger automated remediations. This event driven approach is one of the most common and effective patterns in AWS based environments.&lt;/li&gt;
&lt;li&gt;SLO driven automation : Use SLOs not just for visibility, but as automation triggers. When error budgets are burning or SLOs are breached, workflows can be invoked automatically to initiate remediation actions or escalate to deeper analysis using tools such as Bits AI SRE.&lt;/li&gt;
&lt;li&gt;Automated recovery actions : Implement workflows for common corrective actions such as:

&lt;ul&gt;
&lt;li&gt;Auto rollback of deployments&lt;/li&gt;
&lt;li&gt;Auto scaling of infrastructure&lt;/li&gt;
&lt;li&gt;Traffic shaping or failover
These actions can be executed automatically on AWS using predefined, tested workflows.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Human in the loop safety controls : Automation must always operate within defined guardrails. Approval steps, confidence thresholds, and progressive rollouts ensure that actions are safe, explainable, and reversible. Humans remain in control—automation simply executes faster and more consistently.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;At this level, success is measured by resilience and speed. Incidents are resolved automatically or partially mitigated before users are impacted, and human intervention becomes the exception rather than the rule. This sets the foundation for the final stage: Adaptive Operations, where systems continuously learn and improve over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive Operations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Reaching Autonomous Operations is a huge achievement—it’s like a plane flying on autopilot. But true excellence requires more: systems must not only act autonomously, they must also learn, adapt, and withstand stress. This is the final stage of observability maturity, where systems continuously improve and become resilient to changing conditions. At this stage, the focus shifts from reacting or remediating to continuous adaptation and self optimization. Systems evolve based on operational experience, business impact, and AI driven insights, enabling them to prevent issues before they occur and optimize performance over time.&lt;/p&gt;

&lt;p&gt;A practical approach on AWS using Datadog includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incident retrospectives and prevention : Combine Datadog Incident Management with Bits AI SRE to analyze incidents, identify root causes, and implement prevention strategies that reduce recurrence.&lt;/li&gt;
&lt;li&gt;Continuous alert tuning : Leverage Datadog Watchdog and anomaly detection to automatically adjust alerts based on changing system behavior, ensuring signals remain meaningful and actionable.&lt;/li&gt;
&lt;li&gt;Predictive SLO management : Use forecasting and historical trends to anticipate SLO risks and preemptively adjust systems, workloads, or resources before they impact users.&lt;/li&gt;
&lt;li&gt;Self healing workflows : Integrate Datadog Scorecards, Bits AI SRE, and Workflow Automations to implement closed loop remediation and optimization. This enables AWS workloads to automatically correct deviations, scale intelligently, and maintain business continuity.
At this level, success is measured by resilience, adaptability, and continuous improvement. Systems learn from experience, optimize themselves over time, and maintain business objectives without constant human intervention—truly embodying the vision of AI driven, self managing operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope this updated maturity model helps you design and operate more powerful, resilient systems. Remember: observability is not an afterthought—it sits at the forefront of the AI revolution.&lt;/p&gt;

&lt;p&gt;Maturity begins once telemetry is consistently captured, but it’s truly measured by how much safe decision making authority we can delegate to the platform. And never lose sight of the ultimate goal: Autonomous, Adaptive Operations, where systems continuously learn, optimize, and act with minimal human intervention.&lt;/p&gt;

&lt;p&gt;I’m running a hands on video series on &lt;a href="https://www.youtube.com/playlist?list=PLJoAOwEJRwp1IvVBSiRZYs8eMmszaZZbn" rel="noopener noreferrer"&gt;Datadog Full Stack Observability on AWS&lt;/a&gt;, where you can learn step by step — from beginner to advanced.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>datadog</category>
      <category>aws</category>
      <category>sre</category>
    </item>
    <item>
      <title>Datadog: Observability Lessons from 50+ AWS Apps</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Sat, 17 Jan 2026 02:29:21 +0000</pubDate>
      <link>https://dev.to/aws-builders/datadog-observability-lessons-from-50-aws-apps-37h4</link>
      <guid>https://dev.to/aws-builders/datadog-observability-lessons-from-50-aws-apps-37h4</guid>
      <description>&lt;p&gt;This post shares 15 lessons learned while enabling observability and reliability using Datadog across 50+ large-scale AWS hosted applications. Post covers what worked, what mattered, and what actually improved customer experience.&lt;br&gt;
For a quick background: over the last few years, I have been involved in setting up observability where almost every app was hosted in AWS. These included frontend-facing apps, middleware, backend apps, web and mobile, all of which were distributed with complex dependencies. Most of the apps were direct customer-facing, while others supported critical internal operations. These apps were mainly in the Telco, Media, and Banking &amp;amp; Finance business domains. Now let me get into our topic right away. While following is a nice list, some of these lessons I learned the hard way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 1 - Datadog goes beyond observability, it’s a reliability tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While I’m calling myself an Observability practitioner, I’m very much an SRE. My end goal is to enable world-class customer experience for end users. In order to do that, I rely heavily on Site Reliability Engineering (SRE) concepts. In the world of SRE, there are a few pillars we are focusing on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture– Reliability comes from strong architectures and design patterns&lt;/li&gt;
&lt;li&gt;Observability – Full-stack visibility across systems&lt;/li&gt;
&lt;li&gt;SLI/SLO &amp;amp; Error Budgets – Measuring customer experience&lt;/li&gt;
&lt;li&gt;Release &amp;amp; Incident Engineering – Treating operations as a software problem&lt;/li&gt;
&lt;li&gt;Automation – Eliminate, reduce, simplify, and automate&lt;/li&gt;
&lt;li&gt;Resilience Engineering – Chaos engineering and failure testing&lt;/li&gt;
&lt;li&gt;People &amp;amp; Awareness – The human factor in reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What this means is Observability is a key pillar of the grand scheme of reliability engineering. We enable Observability so we can measure customer experience. If we can measure customer experience, more often than not when it gets falling down, we’ll know how to isolate the root cause quickly and resolve it quickly. Of course, eliminate it promptly if possible. Datadog supports you in all the above pillars. That is why I’m calling it more of a reliability-enhancing tool instead of just an observability tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 2 – Datadog is your partner: Observability is a journey&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Generally, we start with keeping the lights on, making systems observable, then making things correlate, and enabling AIOps. It’s a journey. I have publish a complete guide to the &lt;a href="https://dev.to/aws-builders/aws-observability-maturity-model-v2-297h"&gt;AWS observability maturity model V2&lt;/a&gt;. Datadog is well equipped to enable this journey for you. It has the capabilities.&lt;br&gt;
Generally, people like to start from Infrastructure visibility; these days we are heavily into AWS Lambda, AWS ECS, or AWS EKS, or a combination of all of these. Datadog provides integrations to enable infrastructure visibility for you.&lt;br&gt;
Once you have the Infrastructure visibility, you can use Datadog capabilities to enable Logs, Metrics, and Traces. This will ensure you have observability for your apps. Datadog Service Catalogues and Systems will allow you to bring it together so correlation is prompt. Datadog enables Metric Anomaly detection, Metric Forecasting, and Log anomalies to keep you one step ahead of the game. Use Watchdog, it will look at your entire service scope to identify anomalies for you. Datadog enables full-stack visibility across your entire AWS estate—from code and infrastructure to the business perspective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 3 – Datadog SLOs – What drives it is the ability to measure customer experience&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I like Observability as a byproduct of trying to achieve the ability to measure customer experience. I’m generally thinking about bringing in Service Level Indicators (SLIs) for any app, then converting them to build Service Level Objectives (SLOs). Once you enable Application Performance Monitoring (APM) with Datadog and you have logs, metrics, and tracers, it's about building an SLI dashboard—a single truth dashboard for your system. Then convert it to meaningful SLOs in Datadog. Datadog provides three types of SLOs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;By count – measure SLOs with good events divided by total events.&lt;/li&gt;
&lt;li&gt;By monitor uptime – using a synthetic test to gauge uptime.&lt;/li&gt;
&lt;li&gt;By time slices – using custom uptime definitions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our goal is to go through the Observability journey initially targeting having the ability to build Datadog SLOs. If you have SLOs, you already measure customer experience and you're way ahead of the game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 4 – Datadog Real User Monitoring (RUM) – You need to know what the heck your end users are doing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Observability is great; it lets you have an understanding of your system's internal state. While it’s good, you need to know what your end users are doing. That’s why RUM comes into play. Not only it shows all metrics related to end-user experience, but capabilities such as Session Replay allow you to watch what customers are doing. When a customer complains something is not working, you're a few steps away from finding what that was using Datadog RUM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 5 – Datadog loves when you enhance inbuilt telemetry with code changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While we love Datadog because it enables most things without any code changes, when you do them, it greatly benefits. Like inserting encrypted important details with Sessions so when you're troubleshooting with Datadog RUM, you can filter with user details, product details, etc. Going slightly beyond has massive benefits. APM is the same as well. If there are deep corners where you're not getting that detail, try to do a little bit of code changes. You will see the magic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 6 – There are all kinds of monitors provided by Datadog; use them wisely&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a high level, I see them as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure &amp;amp; Host Reliability: Metric, Host, Process Check, Live Process, Service Check, Change, Integration&lt;/li&gt;
&lt;li&gt;Application Performance &amp;amp; Error Detection: APM, Error Tracking, Anomaly, Outlier, Forecast, Composite&lt;/li&gt;
&lt;li&gt;User Experience &amp;amp; Frontend Reliability: Real User Monitoring, CI &amp;amp; Tests, Network Check&lt;/li&gt;
&lt;li&gt;Logs, Events &amp;amp; Operational Intelligence: Logs, Event, Watchdog, LLM Observability&lt;/li&gt;
&lt;li&gt;Network &amp;amp; Dependency Reliability: NDM NetFlow&lt;/li&gt;
&lt;li&gt;Reliability Objectives &amp;amp; Governance: SLO&lt;/li&gt;
&lt;li&gt;Observability Data Quality: Data Quality (preview)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson 7: Datadog scorecards for observability governance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We would like to define Datadog systems, leverage Datadog service catalogues, and then enable Datadog scorecards. It’s a great automatic way to measure where you are. In-built capabilities are great and you can always expand with customizations using provided APIs.&lt;br&gt;
Datadog scorecards cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observability Best Practices: Ensures services emit the right signals by validating deployment tracking, log ingestion, and log–trace correlation so changes and runtime behavior are fully observable.&lt;/li&gt;
&lt;li&gt;Ownership &amp;amp; Documentation: Confirms every service has clear ownership through defined teams, contacts, code repositories, and documentation to enable fast escalation and effective incident response.&lt;/li&gt;
&lt;li&gt;Production Readiness: Verifies services are operationally ready for production by checking recent deployments, active monitors, on-call coverage, and defined SLOs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson 8: Build your incident management with Datadog On-Call &amp;amp; Incident Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog On-Call is your one-stop place for incident and escalation management. You can define teams, on-call details, and escalations. It will do on-call alerting and provide a lot of good metrics. Initially, when you start, you will see a lot of noise, but over a period of time, you can cut it down to a bare minimum. If you are in Datadog, there is no other on-call management tool you need. Datadog Incident Management allows you to create incidents and track them for closure. You can measure Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) easily with on-call and incident management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 9: Datadog synthetic tests to proactively test your AWS infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You get telemetry only when your end users are using the system. Synthetic tests enable us to test our application by mimicking end users. It’s not just a URL test; you can use Datadog capabilities to automate your smoke tests easily. Datadog provides great locations; you can initiate your tests across the world too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 10: Datadog CI visibility and software changes – Keep track of what the developers are doing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Integrating your pipeline will let Datadog know what teams are deploying to production. By enabling deployment version tracking at Datadog APM, you can compare releases and response times using different releases. Make actions using those insights proactively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 11 – Datadog workflow automations – Great way to automate remediate solutions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog workflow automations is a solid place where you can build complex remediation solutions. It will allow you to automate tedious tasks and let monitors kick off some of these. It’s the first step to automate your job away. Datadog Workflow Automation has integrations with almost all AWS services. It’s a great way to automate AWS infrastructure and other operational workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfjfi8yfj06el2yd0tlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfjfi8yfj06el2yd0tlw.png" alt="Datadog Workflow Automations - AWS Integrations" width="800" height="561"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 12 – Datadog code security – yes, it’s a capability you can use to make your AWS based systems secured&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Libraries (SCA), Static Code (SAST), Runtime code (IAST), Secret scanning (Secrets), and IAC, Datadog code security has really good capabilities to keep you secure. All you have to do is integrate your code base with Datadog code security. That is the first step to get the help you need from Datadog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 13 – Datadog AI Observability – You will use this heavily in future&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every system is now getting integrated with LLMs. So you need a way to measure those AI performances. It’s a great capability to get full-stack AI observability into your systems now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 14: Datadog Bits AI – SRE Agent, your new on-call team mate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog has released Bits AI SRE Agent and it’s awesome. If you're still reading this, it's now available, and it has some great capabilities. It will accelerate root cause analysis to a few short minutes. It makes sense; when Datadog has access to your entire telemetry data, internal system state, what your end users are doing, and how your code is working, it’s able to use those data to identify root causes much faster. What I have seen is it’s having a great capability to correlate things much faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 15 – Datadog UI – it’s the best UI in the town – it provides business visibility to everyone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog UI is great. It’s simple, it's easy, it simplifies complexity in a really cool way. It lets all your stakeholders from SREs, to developers, to senior business executives, or even CTOs to use it easily. There is a persona to be built for everyone. This is a game changer since you can now open business visibility to everyone in your organization.&lt;/p&gt;

&lt;p&gt;These are some of the great lessons I have learned. There are many more, but I think it’s time to stop the list. Datadog is a great observability partner for AWS with built integrations. Give it a try with a 14-day Datadog free trial. Yes, it’s expensive, but I have seen it’s worth every penny. If your goal is not just visibility, but reliability at scale in AWS, Datadog provides the tooling—and more importantly, the operational leverage—to get there&lt;/p&gt;

</description>
      <category>datadog</category>
      <category>aws</category>
      <category>observability</category>
      <category>sre</category>
    </item>
    <item>
      <title>AWS DevOps Agent: 10 best practices to get the most out of It</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Mon, 29 Dec 2025 17:27:33 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-devops-agent-10-best-practices-to-get-the-most-out-of-it-do7</link>
      <guid>https://dev.to/aws-builders/aws-devops-agent-10-best-practices-to-get-the-most-out-of-it-do7</guid>
      <description>&lt;p&gt;One of the key releases that happened as part of AWS re:Invent 2025 was the launch of new frontier autonomous agents by AWS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS DevOps Agent&lt;/li&gt;
&lt;li&gt;AWS Security Agent&lt;/li&gt;
&lt;li&gt;Kiro Autonomous Agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Out of these, the AWS DevOps Agent is going to revolutionize the way DevOps and SRE teams work. In this guide, I'm going to cover the key best practices you need to consider to get the most out of your AWS DevOps Agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. AWS DevOps Agent is not a tool; it’s a capability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You read that correctly. The DevOps Agent is not a magic bullet that will solve all your problems while you sip your cup of tea. It's a capability, and the results will, of course, depend on how you use it.&lt;/p&gt;

&lt;p&gt;Example: You can't just install an AIOps agent and expect the MTTR (Mean Time to Repair) to decrease. Alerts will still fire the same way, runbooks won’t be executable, and there will be no service ownership or defined SLOs (Service Level Objectives).&lt;/p&gt;

&lt;p&gt;To get the most out of the DevOps agent, you need to define SLOs for each service, convert runbooks into executable processes, provide observability, ensure change visibility, and enable other capabilities so the agent can correlate deployments, suggest resolutions, and execute with humans in the loop.&lt;/p&gt;

&lt;p&gt;Remember, capabilities involve people, processes, and tools, not just software.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Observability is the key; the agent needs context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The importance of observability is as crucial as ever. If you thought you could park the observability discussion, you’re in for a rude shock. The agent needs context to act, and context comes from your telemetry data (metrics, logs, and traces).&lt;/p&gt;

&lt;p&gt;It’s best to aggregate all your telemetry sources. If CloudWatch isn’t your cup of tea, integrations are available for all top observability tools, such as Datadog, Dynatrace, New Relic, and Splunk.&lt;/p&gt;

&lt;p&gt;The idea is to ensure the agent can see the blast radius of an incident with the help of telemetry data so it can understand your system’s internal state and act on any changes with the correct intent.&lt;/p&gt;

&lt;p&gt;Example: A load balancer encounters 5xx errors. Without observability context, the DevOps agent only sees the 5xx error count and would likely suggest scaling the load balancer or services. However, with full telemetry, the agent can identify that traces are slow due to SQL queries, and logs show that the RDS connection pool is exhausted with high CPU saturation in the database.&lt;/p&gt;

&lt;p&gt;Now, the DevOps agent can conclude that the root cause is the RDS issue, which is causing the upstream 5xx errors. Scaling the ALB (Application Load Balancer) won’t resolve the problem.&lt;/p&gt;

&lt;p&gt;We need to enable agents to understand the blast radius, not just the symptoms. Observability is key to providing this context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Define golden signals (latency, error rate, saturation and traffic) so agent is able to work with symptoms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents reason better on symptoms or effoects instead of alerts which usually generate lot of noise. The more symptoms agent is having access, the better your agent is able to act on them.&lt;/p&gt;

&lt;p&gt;Example: Instead of defining alerts based on infrastructure metrics like CPU &amp;gt; 80% or memory &amp;gt; 75%, you define thresholds such as checkout latency P95 &amp;gt; 2s or error rate &amp;gt; 1%. Alerts are then triggered due to increased latency or rising error rates.&lt;/p&gt;

&lt;p&gt;In this case, the agent is able to reason about user experience, even when infrastructure metrics are not in an alarming state. This leads to better detection of end-user–impacting issues and more effective root cause analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Agents need guidance; instead of wikis, provide agents with tools.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s common to provide runbooks that offer investigation guidance to agents. But remember, unless you provide real capabilities to your agent, a runbook is just documentation. While documentation is useful, you should aim to provide the agent with actionable solutions.&lt;/p&gt;

&lt;p&gt;For example, provide Lambda functions that can pull telemetry data or execute remediation actions. Step Functions or other automated workflows that are part of the runbook can be made easily executable by the agent.&lt;/p&gt;

&lt;p&gt;Just remember: your newest team member can’t repost failed orders simply by reading how to do it. But if there’s a Lambda function available, they may be able to use it. For instance, one Lambda function can identify the root cause, a second can determine the correct recovery or reporting function, and a third can execute the appropriate Lambda function.&lt;/p&gt;

&lt;p&gt;Guidance must be clearly defined, with preconditions, safe actions, and always with rollback steps. This approach enables your agent to evolve from a recommendation-only agent into one that can actively remediate issues.&lt;/p&gt;

&lt;p&gt;Example: Documentation may state that if the SQS backlog increases, you should check consumer health and restart pods. However, the agent cannot perform these actions on its own. Instead, you need to provide Lambda functions that can fetch queue depth and consumer lag, another Lambda function that can analyze failure patterns, and another function that can safely restart the consumer deployment. A Step Functions workflow can be used to orchestrate all these steps, including rollback.&lt;/p&gt;

&lt;p&gt;During an incident, the agent can invoke these Lambda functions, identify stalled consumers, recommend execution of the Step Functions workflow, and carry it out after approval. In this scenario, the agent acts as an active operator, not just a passive observer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. An agent is like a human, focus on guardrails instead of denying permissions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just giving your agent administrative access is as bad as denying it the permissions it actually needs. An agent requires a reasonable level of access to do its magic.&lt;/p&gt;

&lt;p&gt;While least-privilege IAM roles are important, it’s often more effective to focus on guardrails—clearly defining what the agent can and cannot do. For example, you might allow broad access for diagnostics while tightly controlling or restricting remediation actions.&lt;/p&gt;

&lt;p&gt;With agents, you need to start becoming comfortable with autonomy that operates within well-defined rails, rather than blocking everything by default. This balance enables the agent to be effective while still keeping your environment safe.&lt;/p&gt;

&lt;p&gt;Example of a bad approach: Giving an agent admin access is risky—one bad prompt could cause a production outage. On the other hand, if the agent only has read access to metrics, it delivers zero remediation value.&lt;/p&gt;

&lt;p&gt;A better approach is to provide read-only access across all services, while allowing remediation only through approved capabilities such as Lambda functions or Step Functions. Direct delete, terminate, or drop permissions should never be allowed.&lt;/p&gt;

&lt;p&gt;This model enables the agent to diagnose issues freely, while remediation can occur only through safe, audited paths with built-in guardrails. Autonomy within guardrails is the way forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Have a KT plan for the agent. Your team member needs some babysitting.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Treat the DevOps agent as your new team member. It may be a superhero when it comes to AWS, but it’s still a novice when it comes to your specific cloud implementation.&lt;/p&gt;

&lt;p&gt;You need to train the agent with detailed information so it can develop a full understanding of your architecture, implementations, and even business context. Treat it like an expert Solution Architect who has just joined the team—don’t assume prior knowledge. Share everything you have and onboard it properly, rather than letting it jump straight into firefighting.&lt;/p&gt;

&lt;p&gt;Example: When you onboard a new solution architect, you provide proper knowledge transfer (KT), share architecture diagrams, explain why things exist, outline the business rationale, and discuss past failures. You need to do the same with your DevOps agent.&lt;/p&gt;

&lt;p&gt;Provide architecture diagrams, documentation, service mappings, business context, and known failure patterns. This enables the agent to prioritize a payment API over reporting jobs when managing alerts and to avoid repeating known bad remediations.&lt;/p&gt;

&lt;p&gt;Always remember: context reduces incorrect automation actions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Let agents know what your developers are doing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, it’s a DevOps agent—but it still needs visibility into what your developers are working on. It’s essential to connect your CI/CD pipelines and provide this visibility to the agent.&lt;/p&gt;

&lt;p&gt;This allows the agent to correlate operational issues with recent code changes and deployments. As a result, the agent can identify specific commits or pipeline executions and isolate them to better understand the root cause of issues.&lt;/p&gt;

&lt;p&gt;Let’s be frank: most incidents today are code-related or deployment-related. The old saying still holds true—if you don’t touch it, it won’t break on its own. So when something isn’t working, let your agent answer the critical question: what changed?&lt;/p&gt;

&lt;p&gt;This significantly accelerates the agent’s ability to isolate root causes and reduce mean time to resolution (MTTR).&lt;/p&gt;

&lt;p&gt;Example: Let’s say there is a latency spike at a certain time. The DevOps agent checks the CI/CD pipeline and identifies that a deployment occurred shortly before the spike. The commit included changes to payment-related files.&lt;/p&gt;

&lt;p&gt;The agent then pulls additional metrics and correlates them with high confidence, concluding that the alert is caused by the recent deployment and recommending a rollback. Without this CI/CD context, the agent would waste time investigating infrastructure issues, increasing MTTR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8.Hold your agent’s hand until it grows up. Start with a human in the loop and actively steer the work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course, initially you need to be heavily involved—you can’t realistically expect a fully autonomous agent from day one. You need to observe its behavior, explain context, and provide detailed recommendations that the agent can act on.&lt;/p&gt;

&lt;p&gt;Any remediation action should go through an approval process at the beginning. This is how the journey starts. Over time, you can gradually increase autonomy by putting the right guardrails in place. Remember, even a DevOps agent has to earn your trust.&lt;/p&gt;

&lt;p&gt;Steering the agent is equally important. It’s your environment, so you need to stay actively involved. Use chat features to provide details, discuss failure scenarios, and plan responses in real time. If you notice false alarms or incorrect root-cause analysis, correct the agent. Explain why you disagree so it can learn effectively.&lt;/p&gt;

&lt;p&gt;The idea is not to wait and see until the agent fails, but to proactively take action to ensure the agent succeeds.&lt;/p&gt;

&lt;p&gt;Example: The agent recommends restarting RDS, but a human rejects the action and explains that an RDS restart could cause data loss or customer impact during peak hours. The agent learns about time windows, business constraints, and safer alternatives.&lt;/p&gt;

&lt;p&gt;In later phases, the agent can automatically restart stateless services, while still requiring approval for any data-layer changes. Trust is built through guided autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Measure agent performance using business metrics.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An agent is not a shiny object that you deploy and forget about. It’s actually useless if it doesn’t positively improve outcomes. That means you need to start measuring the right metrics.&lt;/p&gt;

&lt;p&gt;Track indicators such as Mean Time to Resolve (MTTR), noise reduction, the percentage of root causes identified automatically, and the percentage of remediations executed by the agent. These metrics help you understand whether the agent is delivering real value.&lt;/p&gt;

&lt;p&gt;Unless you measure performance and take the necessary actions based on those insights, there will be no meaningful improvement.&lt;/p&gt;

&lt;p&gt;Example: Before introducing the agent, MTTR was 45 minutes and 120 alerts were generated per service-impacting incident. After configuring the DevOps agent, MTTR dropped to 18 minutes, alert noise reduced to 35 alerts per incident, 40% of incidents were auto-diagnosed, and 20% were auto-remediated.&lt;/p&gt;

&lt;p&gt;These are the real business benefits you should strive to achieve. If you can’t demonstrate measurable impact, the agent is just a shiny demo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. Actively look into agent investigation gaps and work to resolve them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A DevOps agent will not be right on the first attempt, especially in the early stages. There will be many investigations it cannot continue due to implementation gaps, missing context, lack of telemetry data, missing capabilities, or permission issues.&lt;/p&gt;

&lt;p&gt;You need to regularly review these investigation gaps and provide the necessary inputs to the agent. Over time, this will enable the agent to become more effective and smarter in the long run.&lt;/p&gt;

&lt;p&gt;Example: The agent stops investigating and reports that it is unable to determine the root cause due to missing database query metrics. In response, you enable RDS Performance Insights, add slow query logs, and create a Lambda function to fetch query statistics.&lt;/p&gt;

&lt;p&gt;With this additional context, the agent is able to identify long-running queries and suggest actions such as index creation or query throttling.&lt;/p&gt;

&lt;p&gt;Every failure is a training data point for your agent, not a reason to abandon it or point fingers when it falls short.&lt;/p&gt;

&lt;p&gt;Finally, you need to continuously evolve with the AWS DevOps agent and take it on the journey.&lt;/p&gt;

&lt;p&gt;If you’re new to AWS DevOps and want to learn step by step, I’m creating a video series that does exactly that. You can check it out here:&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/sjecp8x2BIY"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

</description>
      <category>awsdevopsagent</category>
      <category>aws</category>
      <category>devops</category>
      <category>aiops</category>
    </item>
    <item>
      <title>Location-Based Flood Predictions with AI on AWS: Kalani River Case Study</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Tue, 16 Dec 2025 08:23:23 +0000</pubDate>
      <link>https://dev.to/aws-builders/location-based-flood-predictions-with-ai-on-aws-kalani-river-case-study-307n</link>
      <guid>https://dev.to/aws-builders/location-based-flood-predictions-with-ai-on-aws-kalani-river-case-study-307n</guid>
      <description>&lt;p&gt;To provide some background, I live in Sri Lanka, and just a few weeks ago we experienced one of the most severe flood events in decades.&lt;br&gt;
I live in Kaduwela, a town close to Colombo, where we face a risk of flooding even when there is little or no rainfall in our immediate area. This is mainly because one of Sri Lanka’s major rivers, the Kelani River, flows very close to us. When there is heavy rainfall in the upstream areas of the Kelani River, it naturally creates vulnerability for anyone living along the riverbanks downstream.&lt;/p&gt;

&lt;p&gt;Between 27th and 29th November, Cyclone Ditwah approached Sri Lanka and made landfall on 28th November 2025, unleashing extremely heavy rainfall. This caused rivers to overflow and resulted in widespread flooding across the country. Most upstream areas of the Kelani River received over 200 mm of rain within a 24-hour period, putting immense stress on the entire upstream reservoir and river system. As a result, water levels exceeded capacity and surged rapidly into downstream areas, triggering landslides and floods across the island.&lt;/p&gt;

&lt;p&gt;Living in Colombo, we began to feel the pressure around 29th November. Although there was already some level of flooding caused directly by Cyclone Ditwah, the situation worsened by the hour due to continuous heavy rainfall in the upstream regions of the Kelani River. It had already been predicted that Colombo could face its worst flooding in decades. Significant floods were previously reported in 1989 and 2016, and forecasts suggested that this event could be even more severe.&lt;br&gt;
We were fortunate to live in an area that was not directly flooded during either the 1989 or 2016 floods. However, our area is still highly vulnerable—roads get flooded quickly, exit routes become blocked, and travel can be completely cut off even without water entering homes.&lt;/p&gt;

&lt;p&gt;There were numerous weather forecasts being released, and naturally, during emergencies like this, we tend to glue ourselves in front of the TV, constantly watching 24-hour news coverage. At that point, I began to wonder whether there was a more scientific and data-driven way to assess the actual flood risk.&lt;/p&gt;

&lt;p&gt;The key questions I was trying to answer were:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;• When will my area experience its worst flooding?&lt;br&gt;
• When will the flood risk subside?&lt;br&gt;
• Will floodwaters reach my home?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I believe these are the three fundamental questions anyone living in a flood-risk zone tries to figure out during such events.&lt;/p&gt;

&lt;p&gt;In our case, the situation was slightly easier to analyze because there was very little rainfall in our immediate area. The primary risk was coming from upstream flooding. This narrowed the problem down to understanding when large volumes of water would travel from upstream to downstream through the Kelani River, and how that surge would impact our location.&lt;/p&gt;

&lt;p&gt;To identify a solution, the next step was to better understand the Kelani River itself. The image below provides a high-level view of the problem at hand, illustrating the course of the Kelani River.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjsla9cjafk36grt722z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjsla9cjafk36grt722z.png" alt="Kelani River" width="679" height="880"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Water flows through many areas, but the critical downstream path is:&lt;br&gt;
&lt;strong&gt;Upstream → Kitulgala → Glencourse → Hanwella → Kaduwela → Colombo&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All rainfall received in the upstream regions of the Kelani River eventually flows downstream through this path. Therefore, any significant increase in upstream rainfall directly impacts water levels in Hanwella, Kaduwela, and Colombo, making these areas particularly vulnerable during extreme weather events.&lt;/p&gt;

&lt;p&gt;Next finding the data related to these key areas, there was meter reading happening regularly and data was available at &lt;a href="https://www.dmc.gov.lk/" rel="noopener noreferrer"&gt;https://www.dmc.gov.lk/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Date &amp;amp; Time (25/11/2025)    Nagalagam Street (m)    Hanwella (m)    Glencourse (m)  Kithulgala (m)&lt;br&gt;
18:30   2.20    2.38    10.30   1.78&lt;br&gt;
21:30   1.60    2.26    10.21   1.89&lt;/p&gt;

&lt;p&gt;Nagalagam Street is the river gauging station located in Colombo&lt;/p&gt;

&lt;p&gt;The way flooding unfolds along the Kelani River is relatively predictable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upstream reservoirs and river sections begin to fill first&lt;/li&gt;
&lt;li&gt;Water then flows downstream through Kitulgala&lt;/li&gt;
&lt;li&gt;By monitoring water levels from Kitulgala to Nagalagam Street, we can effectively observe the entire flood progression&lt;/li&gt;
&lt;li&gt;When water levels peak at Kitulgala, they subsequently recede there and then peak downstream at Glencourse, followed by Hanwella and finally Nagalagam Street
This understanding was sufficient for me to build a quick GenAI application to estimate when flooding might impact Colombo and Kaduwela.
I used AWS PartyRock to build this application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I designed a prompt for the app to analyze water levels along the Kelani River and estimate flood risk for Colombo and surrounding areas. Here’s a structured breakdown of each component:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.Extract Latest Readings&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
Extract the most recent water level readings for each station: Kithulgala, Glencourse, Hanwella, Nagalagam Street.&lt;br&gt;
Purpose:&lt;br&gt;
• Capture the current state of the river at multiple points.&lt;br&gt;
• Provides the starting point for all subsequent calculations and risk estimates.&lt;br&gt;
• Ensures analysis is based on real-time conditions, not historical averages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2.Compare Against 2016 Peaks&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
Compare current water levels with 2016 flood peaks:&lt;br&gt;
• Nagalagam Street: 7.65 m&lt;br&gt;
• Hanwella: 10.51 m&lt;br&gt;
• Glencourse: 19.80 m&lt;br&gt;
• Kithulgala: N/A&lt;br&gt;
Purpose:&lt;br&gt;
• Provides a reference baseline for flood severity.&lt;br&gt;
• Highlights areas exceeding historic flood levels, which helps prioritize alerts and resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3.Calculate Flood Height&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
Flood Height = Current Level − 2016 Peak&lt;br&gt;
Purpose:&lt;br&gt;
• Quantifies how much higher or lower current water levels are compared to the worst-known historical event.&lt;br&gt;
• Critical for understanding the magnitude of risk at each station.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.Estimate Kaduwela Level Using Proxy&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
Kaduwela Current ≈ Hanwella Current − 0.6–0.8 m&lt;br&gt;
Kaduwela 2016 Peak = 10.51 m (Hanwella proxy)&lt;br&gt;
Purpose:&lt;br&gt;
• Kaduwela doesn’t have direct measurements in real time.&lt;br&gt;
• Using hydrological proxies allows estimation of water levels based on upstream measurements.&lt;br&gt;
• Provides continuous flood-risk monitoring for a critical transition zone between middle-basin and downstream areas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5.Hydrological Principle: Transition Zone Dynamics&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
• Kaduwela sits between Hanwella and Nagalagam Street.&lt;br&gt;
• Water reaches Kaduwela earlier than Nagalagam Street and recedes earlier.&lt;br&gt;
• Estimate timing differences:&lt;br&gt;
o   Kaduwela resolves 6–12 hours before Nagalagam Street&lt;br&gt;
o   If Nagalagam = 24–36 hours, Kaduwela = 12–18 hours&lt;br&gt;
Purpose:&lt;br&gt;
• Explains flood propagation along the river.&lt;br&gt;
• Ensures the model predicts not only peak levels but also timing.&lt;br&gt;
• Helps residents and authorities prepare for upstream vs downstream risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6.Account for Hydrological and Geographical Factors&lt;/strong&gt;&lt;br&gt;
Instruction (improved):&lt;br&gt;
Account for inter-station distances, river gradient, catchment size, flow velocity, elevation changes, and downstream lag time to produce more accurate flood-level estimates and timing predictions.&lt;br&gt;
Purpose:&lt;br&gt;
• Adds real-world context to the calculations.&lt;br&gt;
• Recognizes that water doesn’t flow instantaneously: topography, distance, and river dynamics affect flood timing and severity.&lt;br&gt;
• Improves accuracy of flood predictions across multiple stations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7.Generate Markdown Table&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
• Include all five locations in downstream order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Kithulgala (upstream)&lt;/li&gt;
&lt;li&gt; Glencourse (upper middle)&lt;/li&gt;
&lt;li&gt; Hanwella (middle)&lt;/li&gt;
&lt;li&gt; Kaduwela (transition zone)&lt;/li&gt;
&lt;li&gt; Nagalagam Street / Colombo (downstream)
• Columns include:
o   Location
o   2016 Max Level
o   Current Level
o   Flood Height vs 2016
o   Status Now (🟢/🟡/🔴)
o   Trend (🟢/🟡/🔴)
o   Peak Status (🟢/🟡/🔴)
o   New Flood Risk
o   Notes
Purpose:
• Provides a clear visual summary for decision-makers.
• Uses color-coded indicators for immediate understanding of risk and trend.
• Ensures consistency in reporting across all stations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;8.Status Indicators Explained&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
• Status Now: 🟢 Low | 🟡 Moderate | 🔴 High/Critical&lt;br&gt;
• Trend: 🟢 Falling fast | 🟡 Slowly falling | 🔴 Rising&lt;br&gt;
• Peak Status: 🟢 Passed | 🟡 At peak | 🔴 Still coming&lt;br&gt;
• New Flood Risk: Describes residual risks (secondary hazards, recurrence, duration)&lt;br&gt;
Purpose:&lt;br&gt;
• Translates numeric data into human-readable risk levels.&lt;br&gt;
• Helps residents and authorities quickly identify which areas need attention now.&lt;br&gt;
• Incorporates residual risk after peak passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9.Example Rows&lt;/strong&gt;&lt;br&gt;
• Glencourse: Peak passed, still elevated, water moving downstream&lt;br&gt;
• Kaduwela: At or near peak, transition zone, clears before Colombo&lt;br&gt;
• Nagalagam Street: Still critical, longest drainage time&lt;br&gt;
Purpose:&lt;br&gt;
• Shows how to interpret table data for decision-making.&lt;br&gt;
• Demonstrates flow progression and lag effects along the river.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10.Summary Paragraph&lt;/strong&gt;&lt;br&gt;
Instruction:&lt;br&gt;
• Explain upstream recession, Hanwella’s peak, Kaduwela as transition zone, Nagalagam Street as longest residual risk&lt;br&gt;
• Highlight timing cascade, residual hazards, recurrence vulnerability, infrastructure exposure, contaminated waters&lt;br&gt;
• Provide safety recommendations prioritizing areas with longest drainage times&lt;br&gt;
Purpose:&lt;br&gt;
• Converts table data into a narrative that is actionable.&lt;br&gt;
• Provides context for emergency response and public awareness.&lt;br&gt;
• Completes the flood analysis workflow from data → calculation → visualization → actionable insights.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnp4f6h38gprnha3w3fs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnp4f6h38gprnha3w3fs.png" alt="Flood Prediction" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Details About the AI Model&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I used Claude 3.5 Sonnet V2 for this project because of its strong reasoning capabilities and structured output formatting, which made it well-suited for analyzing hydrological data and generating clear, actionable tables.&lt;/li&gt;
&lt;li&gt;I deliberately disabled internet access for the model, as I was already supplying all the relevant, real-time water-level data. This ensured that the analysis relied solely on the data I provided, avoiding inconsistencies or external noise.&lt;/li&gt;
&lt;li&gt;I set the temperature to 0 to encourage focused, deterministic responses. This reduced variability and ensured that the output was predictable, consistent, and easy to interpret, which is critical when analyzing flood risk and producing actionable tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach allowed me to gain a clear understanding of peak water levels at each location, including when the peak would occur at Kaduwela and when the flood risk would subside. It provided a sense of control and reassurance during an otherwise uncertain situation.&lt;br&gt;
The next step was to estimate the actual flood risk for my location. While precise predictions are inherently difficult, I developed a practical workaround. Since the floodwaters were nearby, I could pinpoint exact locations of two key points, their water levels, and combine that with the precise location and elevation of my house. Using this data, I had the model generate flood projections for my property. It’s not a perfect solution, but it was a feasible and useful approach given the circumstances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personalized Risk Assessment: Provides flood projections specific to your house/location, rather than general area-wide warnings.&lt;/li&gt;
&lt;li&gt;Early Awareness: Helps anticipate peak water levels and timing, giving time to prepare and take preventive measures.&lt;/li&gt;
&lt;li&gt;Data-Driven Comfort: Using actual upstream measurements combined with your location offers a sense of control and situational awareness during uncertain flood events.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited Accuracy: The approach depends on proxy data and approximations, so predictions may not perfectly reflect real conditions.&lt;/li&gt;
&lt;li&gt;Point-Specific: Works well for specific locations, but cannot provide a comprehensive view for wider areas or multiple properties.&lt;/li&gt;
&lt;li&gt;Model Limitations: The AI model may miss sudden changes in rainfall or upstream surges, as it relies only on the provided data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How I Used AI to Assess Flood Risk at My Location&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To understand the flood risk for my house during the recent Kelani River floods, I created a location-based AI analysis. I approached it as if I were a hydrological flood risk analyst, providing the AI with reference points along the river and my home’s location. Here’s how the process worked:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Points and User Location&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I provided the AI with two reference points along the river and my house location:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Point 1&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coordinates: [Reference Point 1 - Coordinates]&lt;/p&gt;

&lt;p&gt;Current Flood Level: [Reference Point 1 - Flood Level]&lt;/p&gt;

&lt;p&gt;Elevation: [Reference Point 1 - Elevation]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Point 2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coordinates: [Reference Point 2 - Coordinates]&lt;/p&gt;

&lt;p&gt;Current Flood Level: [Reference Point 2 - Flood Level]&lt;/p&gt;

&lt;p&gt;Elevation: [Reference Point 2 - Elevation]&lt;/p&gt;

&lt;p&gt;User’s Location (My House)&lt;/p&gt;

&lt;p&gt;Coordinates: [Your Location - Coordinates]&lt;/p&gt;

&lt;p&gt;Elevation: [Your Location - Elevation]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Location Proximity Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI calculated which reference point is closer to my house and estimated distances in meters. This helps understand which upstream or downstream readings are more relevant for my flood risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Elevation and Topography&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using elevation data, the AI analyzed how terrain affects flood propagation.&lt;/p&gt;

&lt;p&gt;Higher elevations naturally have lower flood risk.&lt;/p&gt;

&lt;p&gt;Lower elevations or downhill positions are more vulnerable.&lt;/p&gt;

&lt;p&gt;Step 3: Flood Level Interpolation&lt;/p&gt;

&lt;p&gt;The AI estimated my house’s likely flood level based on:&lt;/p&gt;

&lt;p&gt;Linear interpolation between the two reference points&lt;/p&gt;

&lt;p&gt;Elevation differences (water flows downhill)&lt;/p&gt;

&lt;p&gt;Upstream vs downstream position in the river basin&lt;/p&gt;

&lt;p&gt;Whether my location is uphill or downhill from the reference points&lt;/p&gt;

&lt;p&gt;This gave a custom flood level prediction for my specific location.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Risk Assessment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using color-coded indicators, the AI assessed the flood risk:&lt;/p&gt;

&lt;p&gt;🟢 GREEN – Low risk, safe, flood levels below dangerous thresholds&lt;/p&gt;

&lt;p&gt;🟡 AMBER – Moderate risk, approaching warning levels, caution advised&lt;/p&gt;

&lt;p&gt;🔴 RED – High risk, at or above thresholds, immediate concern&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Estimated Flood Level&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI provided an interpolated flood level estimate for my house based on the reference points, factoring in both water level and elevation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Comparison with Reference Points&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It explained how my location compares to the reference points in terms of elevation and expected flooding, helping me understand whether I was upstream, downstream, or in a critical transition zone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Water Depth Calculation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By subtracting my house’s elevation from the projected flood level, the AI calculated the expected water depth at my location.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8: Recommendations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI provided actionable advice based on the predicted flood risk:&lt;/p&gt;

&lt;p&gt;Evacuation timing&lt;/p&gt;

&lt;p&gt;Preparing flood barriers&lt;/p&gt;

&lt;p&gt;Monitoring upstream changes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 9: Timing Estimates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, the AI estimated when the flood risk would peak at my house and when it would recede, based on trends observed between the two reference points.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyrd4dizc0xtp8nzxf2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyrd4dizc0xtp8nzxf2f.png" alt="Flood Location" width="642" height="757"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Why This Approach Works&lt;/p&gt;

&lt;p&gt;This location-based analysis allows homeowners to:&lt;/p&gt;

&lt;p&gt;Understand their specific flood risk, not just general area warnings&lt;/p&gt;

&lt;p&gt;See expected water levels and timing&lt;/p&gt;

&lt;p&gt;Make informed decisions about safety and preparation&lt;/p&gt;

&lt;p&gt;By combining reference points, elevation data, and interpolation, this method provides a practical, data-driven solution for assessing flood risk at any individual location along a river.&lt;/p&gt;

&lt;p&gt;Test the App - &lt;a href="https://partyrock.aws/u/sre/qfVrummDF/Location-Based-Flood-Predictions" rel="noopener noreferrer"&gt;https://partyrock.aws/u/sre/qfVrummDF/Location-Based-Flood-Predictions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Of course, this approach is not perfect, but it gave me something constructive to focus on during a very stressful period. It allowed me to feel like I had some control over what was happening—at least, that’s how I like to think about it.&lt;br&gt;
Next Steps&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This solution could be improved into a full system where anyone can provide their Google Maps location, and the system predicts their flood risk automatically.&lt;/li&gt;
&lt;li&gt;We could incorporate additional data points and more sophisticated scientific formulas to make the predictions more robust and accurate.&lt;/li&gt;
&lt;li&gt;The AI model could be integrated with advanced forecasting capabilities, including rainfall projections and upstream river data, for real-time monitoring.&lt;/li&gt;
&lt;li&gt;If anyone is interested in taking this project to the next level, feel free to send me a message on LinkedIn.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>ai</category>
      <category>floodprediction</category>
      <category>partyrock</category>
    </item>
    <item>
      <title>AWS Outage Exposed Your SaaS Stack — Here’s How to Make It Resilient</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Tue, 21 Oct 2025 12:03:53 +0000</pubDate>
      <link>https://dev.to/aws-builders/aws-outage-exposed-your-saas-stack-heres-how-to-make-it-resilient-40gj</link>
      <guid>https://dev.to/aws-builders/aws-outage-exposed-your-saas-stack-heres-how-to-make-it-resilient-40gj</guid>
      <description>&lt;p&gt;It is now well documented that the us-east-1 region experienced a significant outage on AWS on October 20th, 2025. While there is already much discussion around why such a vast number of systems were impacted and what design weaknesses were exposed, for me, the real story isn’t just that AWS went down (of course, us-east-1), but rather how many SaaS providers went down with it.&lt;/p&gt;

&lt;p&gt;There is a growing push toward adopting SaaS platforms due to their obvious advantages — abstracting away infrastructure management and letting teams focus on solving business problems that matter. However, while SaaS is beneficial, it hides many resiliency weaknesses — until you get the shock of your life during a major cloud outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Closer Look: Example E-commerce Architecture Affected&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s take one example: if you're running a large e-commerce platform, your architecture might rely on the following stack — and here's how each layer was affected.&lt;/p&gt;

&lt;p&gt;Note: The SaaS dependency and impact details in this article are based on publicly available information, incident reports, and observed behavior during the AWS us-east-1 outage. Some examples are illustrative or inferential in nature and may not reflect the full internal architecture of each provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontend Hosting — Vercel&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vercel, a popular platform for Next.js applications, was reportedly impacted during the outage, likely due to its reliance on AWS infrastructure such as Lambda (for serverless functions), EC2 (for compute), and DynamoDB (for metadata storage).&lt;br&gt;
During the outage, users experienced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed deployments&lt;/li&gt;
&lt;li&gt;Elevated error rates in serverless functions&lt;/li&gt;
&lt;li&gt;CDN rerouting issues&lt;/li&gt;
&lt;li&gt;Intermittent dashboard access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While Vercel's architecture spans multiple regions, users whose deployments were primarily in US-East-1 faced notable downtime, with some sites and APIs going offline temporarily.&lt;/p&gt;

&lt;p&gt;Vercel CEO Guillermo Rauch acknowledged the issue on X:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzbtuik2iyx90cbl0pk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyzbtuik2iyx90cbl0pk2.png" alt="Vercel CEO Guillermo Rauch acknowledged the issue on X" width="800" height="646"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity Management — Auth0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Auth0, an Okta company, is widely assumed to rely heavily on AWS infrastructure, which may have contributed to service disruptions during the US-East-1 outage. For customers in that region, failover mechanisms such as Geo-HA may have been triggered, though public information on their effectiveness is limited.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foh5qq25v6p6ty6t3bkua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foh5qq25v6p6ty6t3bkua.png" alt="Auth0 Impact" width="780" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability — Datadog&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datadog was likely affected to some extent during the AWS US-East-1 outage, given its integration with AWS services such as DynamoDB, EC2, and Lambda for telemetry ingestion (metrics, logs, traces).&lt;/p&gt;

&lt;p&gt;Possible effects for users included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delayed data processing&lt;/li&gt;
&lt;li&gt;Gaps in historical logs&lt;/li&gt;
&lt;li&gt;Reduced visibility into workloads running on AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Datadog operates on a multi-cloud architecture (AWS, GCP, Azure), so the platform did not experience complete downtime. Nevertheless, users relying on AWS-specific integrations may have seen temporary cascading issues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z2ipgkvxuhb0f9n1k99.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z2ipgkvxuhb0f9n1k99.png" alt="Datadog Impact" width="800" height="617"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Payments — Stripe&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stripe may have experienced some service disruptions during the AWS US-East-1 outage. Much of Stripe’s infrastructure runs on AWS (EC2 for compute, S3 for storage), which could have contributed to temporary issues.&lt;/p&gt;

&lt;p&gt;Possible effects reported by users included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elevated API error rates&lt;/li&gt;
&lt;li&gt;Dashboard access issues&lt;/li&gt;
&lt;li&gt;Payment processing delays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While Stripe did not experience a full outage, dependencies on AWS services may have led to cascading issues affecting certain workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communication &amp;amp; Collaboration — Slack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Slack reportedly experienced some service disruptions during the AWS US-East-1 outage, possibly due to dependencies on AWS services such as EC2, S3, and Lambda.&lt;/p&gt;

&lt;p&gt;Users may have noticed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed message deliveries&lt;/li&gt;
&lt;li&gt;Delayed notifications&lt;/li&gt;
&lt;li&gt;Intermittent workspace loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0yfhw99h0xqck9vvyke.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0yfhw99h0xqck9vvyke.png" alt="Slack Impact" width="800" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are just a few examples. The list goes on — and it reveals a critical point: SaaS platforms promise scalability, ease of use, and low maintenance, but their black-box nature hides several resiliency vulnerabilities — which the AWS outage brought into the spotlight.&lt;/p&gt;

&lt;p&gt;What Went Wrong: Key Issues Exposed&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cascading Failures from Shared Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Many SaaS providers run on AWS and default to the US-East-1 region due to its maturity and low latency.&lt;br&gt;
But when it fails, it creates ripples across countless services, often in unexpected ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SaaS ≠ Always-On&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without transparency into a provider’s infrastructure, you can’t audit failover paths or validate high availability claims.&lt;br&gt;
This creates a domino effect, where one outage stalls your entire workflow, and you're left completely blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even Giants Weren’t Immune&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some large streaming platforms may experience disruptions during regional cloud outages. For example, high-traffic services like Disney+ Hotstar could be affected by dependencies on cloud infrastructure such as AWS EC2 or S3, though no confirmed reports are available for this specific outage.&lt;/p&gt;

&lt;p&gt;The reality is that most of the issues discussed above are beyond our direct control. SaaS providers abstract away their backend infrastructure, which can leave you vulnerable to upstream failures. However, there are proactive steps you can take within your control to mitigate these risks and improve system resilience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SaaS Resilience Improvement Plan&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Map and document SaaS dependencies&lt;/strong&gt;&lt;br&gt;
Create and maintain an up-to-date inventory of all SaaS services your system relies on, both directly and indirectly. Include details such as the underlying cloud infrastructure (e.g., AWS, GCP), regional hosting, and the criticality of each service to your operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Implement client-side circuit breakers and retries&lt;/strong&gt;&lt;br&gt;
Add fault-tolerance mechanisms in your frontend and backend code, such as circuit breakers, timeouts, exponential backoff retries, and fallback UIs. This ensures that transient SaaS outages do not fully break your user experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cache critical data locally&lt;/strong&gt;&lt;br&gt;
For high-availability features (e.g., product catalog, user settings), implement edge or client-side caching strategies. This allows your system to serve stale-but-usable data if upstream SaaS services are temporarily unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set up independent monitoring and alerting&lt;/strong&gt;&lt;br&gt;
Do not rely solely on the provider’s status pages. Implement external health checks and synthetic monitoring to independently track the availability and performance of critical third-party services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Enable redundant SaaS providers (where feasible)&lt;/strong&gt;&lt;br&gt;
For high-risk areas such as authentication, payments, or observability, consider integrating with secondary SaaS providers that can be switched to manually or programmatically during outages. Be mindful that this can increase complexity and may require handling differences between providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Configure multi-region deployment for services under your control&lt;/strong&gt;&lt;br&gt;
Where you manage infrastructure or use PaaS providers (e.g., Vercel, Firebase), ensure that deployments span multiple regions. Avoid over-reliance on a single cloud region, such as AWS us-east-1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Use event-driven buffering for critical workflows&lt;/strong&gt;&lt;br&gt;
Decouple workflows using queues or message buffers (e.g., SQS, Kafka, Durable Queues) so that temporary upstream failures do not result in data loss or dropped transactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Test system resilience with chaos engineering&lt;/strong&gt;&lt;br&gt;
Regularly simulate SaaS outages (e.g., temporarily disabling a key API) to test how your system behaves under failure conditions and identify points of fragility before a real outage occurs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Establish offline-friendly workloads&lt;/strong&gt;&lt;br&gt;
Where possible, allow users to continue working in a limited or offline mode—especially in mobile apps or agent consoles—and sync data back once the upstream SaaS service recovers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. Monitor and enforce SaaS SLAs&lt;/strong&gt;&lt;br&gt;
Track uptime, latency, and incident response of critical SaaS providers. Ensure they meet their SLA commitments, and escalate contractually or operationally if violations become frequent.&lt;/p&gt;

&lt;p&gt;These strategies will not eliminate risk entirely, and that’s okay. But they can significantly reduce exposure so that when the unexpected happens, you’re not scrambling—you’re calmly sipping a cup of tea.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>aws</category>
      <category>saas</category>
    </item>
    <item>
      <title>I built an AI Agent That reveals Wall Street Sentiment in seconds</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Sun, 31 Aug 2025 21:08:32 +0000</pubDate>
      <link>https://dev.to/indika_wimalasuriya/i-built-an-ai-agent-that-reveals-wall-street-sentiment-in-seconds-4ma2</link>
      <guid>https://dev.to/indika_wimalasuriya/i-built-an-ai-agent-that-reveals-wall-street-sentiment-in-seconds-4ma2</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/brightdata-n8n-2025-08-13"&gt;AI Agents Challenge powered by n8n and Bright Data&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What I Built&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I built an AI agent that aggregates and analyzes sentiment from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All online sources&lt;/li&gt;
&lt;li&gt;Major US stock market news sites&lt;/li&gt;
&lt;li&gt;X.com user posts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It delivers a consolidated Wall Street sentiment report in seconds. The agent scans trending trader discussions, financial headlines, and online chatter, then generates a concise, actionable summary—sent directly via email—so investors can understand market sentiment instantly without spending hours researching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;h3&gt;
  
  
  n8n Workflow
&lt;/h3&gt;

&lt;p&gt;GitHub - &lt;a href="https://github.com/wimalasuriyaib/WallStreetSentimentAnalyzer" rel="noopener noreferrer"&gt;https://github.com/wimalasuriyaib/WallStreetSentimentAnalyzer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wall Street Sentiment Analyzer - This is my main Agent workflow.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qenq70zm2z31w4vbj6b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qenq70zm2z31w4vbj6b.png" alt="Wall Street Sentiment Analyzer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online Stock Market Sentiment Workflow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6io9e7sw3kvuhgr2d038.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6io9e7sw3kvuhgr2d038.png" alt="Online Stock Market Sentiment Workflow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;X.com Stock Market Sentiment Workflow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uss1gblq47ls2pm5a8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uss1gblq47ls2pm5a8q.png" alt="IX.com Stock Market Sentiment Workflow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhe804mytyi1wx62ifb1h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhe804mytyi1wx62ifb1h.png" alt="Agent Capabilities"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final ouput - US Stock Market Sentiment Report&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudl0b6bdgps1oyfvdg35.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudl0b6bdgps1oyfvdg35.png" alt="US Stock Market Sentiment Report"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch the demo video for walk through of the Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/9TgFsQK9tck"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Technical Implementation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The workflow is designed to automatically fetch real-time US stock market sentiment, process the data, and generate a concise word summary suitable for a blog post. It leverages BrightData for data extraction, Google AI for querying, and Google Gemini (PaLM API) for text summarization. The workflow is fully automated and orchestrated using n8n, an open-source workflow automation tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Instructions&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Triggering:&lt;/strong&gt; Manual execution via the Manual Trigger node or can be scheduled with a Cron node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Collection:&lt;/strong&gt; BrightData Web Scraper Node: Sends a query to the BrightData dataset API to extract market sentiment from a specified URL using a predefined prompt &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshot Monitoring:&lt;/strong&gt; The workflow waits for the BrightData snapshot to be ready and monitors its progress using the Check Snapshot Status node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Handling:&lt;/strong&gt; Once the snapshot is ready, the Download Snapshot Content node retrieves the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edit Fields Node:&lt;/strong&gt; Normalizes the output JSON and extracts the relevant answer_text for further processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Summarization:&lt;/strong&gt; Google Gemini Node: Passes the extracted text to the Gemini model (models/gemini-2.0-flash) to generate a concise word summary suitable for a blog post.&lt;/p&gt;

&lt;p&gt;Prompts are dynamically injected from the snapshot content for contextual summarization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Choice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google Gemini (PaLM API) – Selected for its ability to generate human-like, high-quality text summaries and handle complex financial language and sentiment analysis effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory / Data Handling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Workflow uses pinData to store intermediate data (answer_text) securely within n8n.&lt;/p&gt;

&lt;p&gt;Each node is stateless, relying on BrightData snapshots to maintain consistency and reproducibility.&lt;/p&gt;

&lt;p&gt;Workflow design ensures error handling via conditional checks (IF nodes) to retry waiting or snapshot download until data is ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools Used&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;n8n:&lt;/strong&gt; Orchestrates the workflow, manages triggers, and passes data between nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BrightData:&lt;/strong&gt; Handles data extraction from dynamic websites using snapshots. Provides monitoring APIs to ensure completeness and accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Gemini (PaLM API):&lt;/strong&gt; Processes raw sentiment text. Produces coherent, concise summaries ready for blog publishing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asynchronous snapshot handling: Ensures the workflow doesn’t fail if data isn’t ready immediately.&lt;/li&gt;
&lt;li&gt;Dynamic prompt injection: Allows custom queries without modifying the workflow logic.&lt;/li&gt;
&lt;li&gt;Seamless integration: BrightData and Google Gemini nodes are fully credentialed and reusable for multiple datasets or sentiment sources.&lt;/li&gt;
&lt;li&gt;Scalable design: Can be extended to multiple stock tickers, social media sentiment, or regional markets by adjusting the query parameters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Future Enhancements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrate automated blog publishing via WordPress or Medium APIs.&lt;/li&gt;
&lt;li&gt;Add historical sentiment tracking and trend analysis.&lt;/li&gt;
&lt;li&gt;Incorporate alerts or notifications if sentiment changes drastically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Bright Data Verified Node&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Bright Data Verified Node is a critical component in our stock market sentiment workflow, providing reliable and scalable web data extraction without the typical challenges of web scraping. By leveraging Bright Data, we can trigger dataset snapshots, monitor their progress in real-time, and download structured results automatically. This eliminates the need for building custom scraping pipelines, handling IP rotation, or managing proxy networks—tasks that are notoriously error-prone and time-consuming.&lt;/p&gt;

&lt;p&gt;Using Bright Data ensures high data accuracy and compliance, which is particularly important when accessing dynamic and frequently updated sources like Google AI search results. Without it, we would face the complexity of dealing with anti-bot mechanisms, frequent source changes, and the overhead of continuously maintaining scraping scripts. Such manual approaches often result in inconsistent data, higher failure rates, and significant delays in processing, all of which could compromise the quality of downstream AI analysis.&lt;/p&gt;

&lt;p&gt;By integrating the Verified Node, our solution gains reliability, speed, and maintainability. The node abstracts away the operational burdens of web data extraction, allowing us to focus on extracting insights, summarizing market sentiment with AI, and generating actionable content. Bright Data, therefore, transforms what could be a fragile, labor-intensive process into a seamless, scalable workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Journey&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Participating in this hackathon was an incredibly exciting experience, as it was my first time using both n8n and Bright Data. I began by spending several hours watching n8n videos. Since n8n is a low-code solution, my initial approach was to jump straight into building—but I quickly realized I lacked the basics and failed miserably after a few hours. While it is designed to accelerate development, mastering the fundamentals is essential.&lt;/p&gt;

&lt;p&gt;I then went through the n8n Beginner Course on YouTube at high speed and complemented it by using freely available templates to build small projects for hands-on practice. I took a similar approach with Bright Data, experimenting with small projects to get comfortable with its capabilities.&lt;/p&gt;

&lt;p&gt;Once I felt confident with both tools, I defined my problem statement: capture Wall Street sentiment analysis in seconds. Developing this stock market sentiment workflow was both challenging and rewarding. The initial goal was to capture real-time investor sentiment reliably and convert it into actionable AI-driven insights. A major hurdle was handling dynamic web content, especially from sources like Google AI Search, which frequently change and block automated requests. Without a robust solution, scraping would have been slow, error-prone, and difficult to maintain.&lt;/p&gt;

&lt;p&gt;Integrating Bright Data’s Verified Node was a game-changer. It provided a secure, compliant, and scalable way to trigger dataset snapshots, monitor progress, and retrieve structured results effortlessly. This eliminated the need to manually manage proxies, IP rotations, and anti-bot measures.&lt;/p&gt;

&lt;p&gt;Processing large amounts of unstructured text data was another challenge. Leveraging Google Gemini (PaLM) for summarization enabled us to convert raw responses into concise, high-quality 200-word blog posts. Combining Bright Data’s reliability with AI-powered summarization streamlined the workflow and significantly reduced operational complexity.&lt;/p&gt;

&lt;p&gt;Since Bright Data can’t be directly added as an agent tool, I created two separate workflows: one to gather sentiment from online content and another from users on x.com. It took me some time to figure this out, but once implemented, completing the project became much faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hackception: Mini Hackathon Inside the Hackathon&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdctlvwsybq1dlurgoue.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdctlvwsybq1dlurgoue.jpg" alt="ini Hackathon Inside the Hackathon"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A major highlight of this hackathon was involving students from the University of Peradeniya. During the Code to Cloud training program, I decided to run a mini hackathon within the main hackathon, introducing students to n8n and Bright Data. We launched it on Friday (just two days to go), conducted walkthroughs of sample projects, and then let students develop their own workflows. So far, one student has submitted their project, and I expect more submissions as the deadline approaches. To make it more exciting, we offered two free tickets to AWS Community Day Sri Lanka for students who delivered strong projects.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Check out my quick overview video on n8n and BrightData on YouTube: *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=r91UivY0v2o" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuioz68cgg66efc54r4x.jpg" alt="Watch on YouTube"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This journey reinforced the value of automation, scalability, and robust integrations, allowing me to focus on insights rather than infrastructure. Working with n8n was empowering, enabling rapid development of agentic solutions, while Bright Data simplified web data collection immensely. Overall, I gained deep technical knowledge, built a functional stock sentiment workflow, and successfully ran a hackathon inside a hackathon, inspiring the next generation of tech enthusiasts. I couldn’t ask for a more fulfilling experience.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>n8nbrightdatachallenge</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Amazon Cognito Observability Best Practices with Datadog</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Sun, 10 Aug 2025 12:01:52 +0000</pubDate>
      <link>https://dev.to/aws-builders/amazon-cognito-observability-best-practices-with-datadog-32p3</link>
      <guid>https://dev.to/aws-builders/amazon-cognito-observability-best-practices-with-datadog-32p3</guid>
      <description>&lt;p&gt;Amazon Cognito is an user authentication and authorization service that lets you enable sign-up, sign-in, and access control for your web and mobile systems. Cognito handles user accounts, password recovery, multi-factor authentication, and more. It also allows integration with popular single sign-on (SSO) services such as Google, Facebook, and Apple. Finally, one of its most important features is the ability to scale to millions of users.&lt;/p&gt;

&lt;p&gt;There Are Two Types of Cognito Pools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. User Pools&lt;/strong&gt;&lt;br&gt;
User Pools handle user sign-up, sign-in, and authentication. They act as a user directory managed by AWS. User Pools provide features such as multi-factor authentication (MFA), password policies, and integration with identity providers like Google, Apple, SAML, and OIDC.&lt;/p&gt;

&lt;p&gt;The output of a User Pool is a set of tokens for authenticated users: an ID token (JWT), an access token, and a refresh token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Identity Pools (also known as Federated Identities)&lt;/strong&gt;&lt;br&gt;
Identity Pools provide temporary AWS credentials that allow authenticated users to access AWS resources directly. They can work with User Pools or other identity providers.&lt;/p&gt;

&lt;p&gt;Identity Pools can federate identities from multiple sources into a single AWS identity. They use AWS Security Token Service (STS) to issue temporary AWS access keys based on assigned IAM roles.&lt;/p&gt;

&lt;p&gt;In this blog post, I will walk through observability in Amazon Cognito User Pools.&lt;/p&gt;

&lt;p&gt;First things first: Cognito observability mainly relies on two types of telemetry data — metrics and logs. Let’s go through them in detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Cognito Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can enable the Amazon Cognito – Datadog integration to collect and monitor these metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F764lmhz25safdt2fo35m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F764lmhz25safdt2fo35m.png" alt="Amazon Cognito Datadog Integration" width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This integration will enable below metrics: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Sign-In Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Measure user authentication activity and throttling.&lt;/p&gt;

&lt;p&gt;✅ Sign-in Success % → aws.cognito.sign_in_successes&lt;br&gt;
📊 Sign-in Requests → aws.cognito.sign_in_successes.samplecount&lt;br&gt;
🏆 Successful Sign-ins → aws.cognito.sign_in_successes.sum&lt;br&gt;
🚫 Throttled Sign-ins → aws.cognito.sign_in_throttles&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Sign-Up Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track new user registrations and throttling.&lt;/p&gt;

&lt;p&gt;✅ Sign-up Success % → aws.cognito.sign_up_successes&lt;br&gt;
📊 Sign-up Requests → aws.cognito.sign_up_successes.samplecount&lt;br&gt;
🏆 Successful Sign-ups → aws.cognito.sign_up_successes.sum&lt;br&gt;
🚫 Throttled Sign-ups → aws.cognito.sign_up_throttles&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Token Refresh Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitor token refresh performance and throttling.&lt;/p&gt;

&lt;p&gt;✅ Token Refresh Success % → aws.cognito.token_refresh_successes&lt;br&gt;
📊 Token Refresh Requests → aws.cognito.token_refresh_successes.samplecount&lt;br&gt;
🏆 Successful Token Refreshes → aws.cognito.token_refresh_successes.sum&lt;br&gt;
🚫 Throttled Token Refreshes → aws.cognito.token_refresh_throttles&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Federation Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track identity federation success and throttling.&lt;/p&gt;

&lt;p&gt;✅ Federation Success % → aws.cognito.federation_successes&lt;br&gt;
📊 Federation Requests → aws.cognito.federation_successes.samplecount&lt;br&gt;
🏆 Successful Federation Requests → aws.cognito.federation_successes.sum&lt;br&gt;
🚫 Throttled Federation Requests → aws.cognito.federation_throttles&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Risk &amp;amp; Security Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Measure detected risks and blocked requests.&lt;/p&gt;

&lt;p&gt;⚠️ Account Takeover Risk → aws.cognito.account_take_over_risk, aws.cognito.account_takeover_risk&lt;br&gt;
🔐 Compromised Credential Risk → aws.cognito.compromised_credential_risk, aws.cognito.compromised_credentials_risk&lt;br&gt;
🟢 No Risk Detected → aws.cognito.no_risk&lt;br&gt;
🛑 Any Risk Detected → aws.cognito.risk&lt;br&gt;
⛔ Blocked by Config → aws.cognito.override_block&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Cognito Logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon Cognito supports two plans, including the Plus plan — an enhanced set of user pool features designed for applications that require advanced security options. The Plus plan enables logging and analysis of user activity. It allows you to access logs, risk ratings, and CloudWatch metrics related to user authentication activity within your user pool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw203c8fooqmk7gtfn3h8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw203c8fooqmk7gtfn3h8.png" alt="Amazon Cognito Plan Types" width="800" height="695"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You need to configure your Datadog Lambda forwarder function in AWS and add Cognito logs as a trigger to send the logs to Datadog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq59znzlcm5spwpmep1cj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq59znzlcm5spwpmep1cj.png" alt="Datadog Log Forwarder for Cognito" width="800" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will enable you to receive Cognito logs in Datadog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vddtu2n3i0hatp1tkqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9vddtu2n3i0hatp1tkqg.png" alt="Cognito Logs in Datadog" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amazon Cognito Log Attributes are as follows : &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Group&lt;/th&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User &amp;amp; Identity Information&lt;/td&gt;
&lt;td&gt;🔶 userName&lt;/td&gt;
&lt;td&gt;The username involved in the event.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;userSub&lt;/td&gt;
&lt;td&gt;Unique UUID assigned to the user in the User Pool.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;idpName&lt;/td&gt;
&lt;td&gt;Identity Provider name (e.g., Google, Facebook, SAML).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;clientId&lt;/td&gt;
&lt;td&gt;App client ID used for the request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;userPoolId&lt;/td&gt;
&lt;td&gt;Cognito User Pool identifier.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;id&lt;/td&gt;
&lt;td&gt;Internal log event identifier.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event Context&lt;/td&gt;
&lt;td&gt;🔶 eventType&lt;/td&gt;
&lt;td&gt;Type of event (e.g., SignUp, SignIn, PasswordChange).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;eventSource&lt;/td&gt;
&lt;td&gt;Source of the event (e.g., USER_AUTH_EVENTS).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;🔶 eventResponse&lt;/td&gt;
&lt;td&gt;Event status (e.g., Pass, Fail, InProgress).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;eventId&lt;/td&gt;
&lt;td&gt;Unique ID for the event.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;eventTimestamp / timestamp&lt;/td&gt;
&lt;td&gt;Event time in epoch milliseconds.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;creationDate&lt;/td&gt;
&lt;td&gt;Date/time the event record was created.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;challenges&lt;/td&gt;
&lt;td&gt;Authentication challenges and outcomes (e.g., Password:Success).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Risk &amp;amp; Security Signals&lt;/td&gt;
&lt;td&gt;🔶 riskDecision&lt;/td&gt;
&lt;td&gt;Risk analysis result (e.g., PASS, FAIL, BLOCK).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;compromisedCredentialDetected&lt;/td&gt;
&lt;td&gt;Whether compromised credentials were detected (true/false).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;riskLevel&lt;/td&gt;
&lt;td&gt;Level of risk detected (e.g., Low, Medium, High).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Client &amp;amp; Location Data&lt;/td&gt;
&lt;td&gt;🔶 ipAddress&lt;/td&gt;
&lt;td&gt;IP address of the client making the request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;city&lt;/td&gt;
&lt;td&gt;City from IP geolocation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;country&lt;/td&gt;
&lt;td&gt;Country from IP geolocation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;deviceName&lt;/td&gt;
&lt;td&gt;Browser and OS details (e.g., Chrome 138, Windows 10).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging &amp;amp; Invocation Data&lt;/td&gt;
&lt;td&gt;logLevel&lt;/td&gt;
&lt;td&gt;Log severity level (e.g., INFO, WARN, ERROR).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;host&lt;/td&gt;
&lt;td&gt;Host name (e.g., cognito).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;service&lt;/td&gt;
&lt;td&gt;AWS service producing the log (e.g., cloudwatch).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;version&lt;/td&gt;
&lt;td&gt;Log event schema version.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;invoked_function_arn&lt;/td&gt;
&lt;td&gt;ARN of the Lambda function processing/forwarding the log.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;logSourceId.userPoolId&lt;/td&gt;
&lt;td&gt;User Pool ID from log source metadata.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;requestId&lt;/td&gt;
&lt;td&gt;AWS request identifier for the service call.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback Data (Optional)&lt;/td&gt;
&lt;td&gt;eventFeedbackDate&lt;/td&gt;
&lt;td&gt;Date feedback was recorded.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;eventFeedbackProvider&lt;/td&gt;
&lt;td&gt;Entity providing the feedback.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;eventFeedbackValue&lt;/td&gt;
&lt;td&gt;Feedback result (e.g., Valid, Invalid).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Miscellaneous&lt;/td&gt;
&lt;td&gt;hasContextData&lt;/td&gt;
&lt;td&gt;Boolean indicating additional context data availability.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Username, Event Type, Event Response, Risk Decision, and IP Address are the most commonly used attributes that you can use to create rich custom metrics to facilitate many of your fine-grained drill-down needs.&lt;/p&gt;

&lt;p&gt;Finally, it's about creating a Service Level Indicator (SLI) dashboard that provides a business perspective on things.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhvln5hntrpjswa8fafo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuhvln5hntrpjswa8fafo.png" alt="SLI Dashboard" width="800" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Troubleshooting Cognito-Related Issues: Best Practices&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Use Amazon Cognito Plus Tier&lt;br&gt;
The Cognito Plus tier is highly recommended.&lt;br&gt;
It enables log delivery and provides risk-based metrics such as riskDecision, eventType, and eventResponse, which are essential for troubleshooting authentication and security issues.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build Custom Metrics Using Logs in Datadog&lt;br&gt;
Cognito logs come with rich attributes (refer to the previous table), allowing you to create powerful custom metrics.&lt;br&gt;
These metrics can offer deep visibility into user behavior, login patterns, error spikes, and other critical insights.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set Up SLI and SLO Dashboards&lt;br&gt;
It's important to translate technical metrics into your business context — in other words, what your end users are actually experiencing.&lt;br&gt;
This allows you to build meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) that track reliability from a user-focused perspective.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's a wrap on my AWS Cognito Observability Guide. Use these best practices to improve visibility, reduce troubleshooting time, and align system metrics with business goals.&lt;/p&gt;

</description>
      <category>cognito</category>
      <category>awsobservability</category>
      <category>sre</category>
      <category>datadog</category>
    </item>
    <item>
      <title>Amazon API Gateway Observability Best Practices with Datadog</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Sun, 03 Aug 2025 04:37:54 +0000</pubDate>
      <link>https://dev.to/aws-builders/amazon-api-gateway-observability-best-practices-with-datadog-1eod</link>
      <guid>https://dev.to/aws-builders/amazon-api-gateway-observability-best-practices-with-datadog-1eod</guid>
      <description>&lt;p&gt;AWS API Gateway is a fully managed service from AWS that allows you to create, publish, and maintain APIs at any scale. It acts as a gateway to your application's backend services, including AWS Lambda, EKS, ECS, EC2, and more.&lt;/p&gt;

&lt;p&gt;You can explore the full documentation here:&lt;br&gt;
🔗 API Gateway Developer Guide - You can refer all the details you wants related to API Gateway &lt;a href="https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make sure we’re aligned on the fundamentals, I’ve created an API Gateway Essentials summary below. It gives you a quick overview of the core capabilities this service offers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbet8c6gx5ujnqje3zka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbet8c6gx5ujnqje3zka.png" alt="API Gateway Essentials" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The main objective of this blog is to walk through how to monitor and observe AWS API Gateway using Datadog — one of the leading observability platforms that provides full-stack visibility into AWS environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before diving in, a quick refresher:&lt;/strong&gt;&lt;br&gt;
Observability is the practice of using telemetry data (logs, metrics, and traces) to understand a system’s internal state. In this case, we’ll leverage API Gateway’s logs, metrics, and traces to gain insights into what’s really happening under the hood.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Gateway Logs&lt;/strong&gt;&lt;br&gt;
AWS provides built-in support for enabling logs. You can enable them under API Gateway → Stages, where logging options are available for both access logs and execution logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfrqswwpl5n3u6xdfhd9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpfrqswwpl5n3u6xdfhd9.png" alt="API Gateway Logs Configuration" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once logging is enabled, you can configure API Gateway to send logs to Datadog.&lt;/p&gt;

&lt;p&gt;Configuration guide: &lt;a href="https://docs.datadoghq.com/integrations/amazon-api-gateway/" rel="noopener noreferrer"&gt;Datadog + API Gateway Integration&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fweaxceuqj431nf1vifmu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fweaxceuqj431nf1vifmu.png" alt="API Gateway Logs in Datadog" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Logs Matter&lt;/strong&gt;&lt;br&gt;
Logs are essential for troubleshooting issues in API Gateway. In most cases, failures fall into one of two categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend-related issues&lt;/strong&gt;&lt;br&gt;
Unresponsive services (e.g., Lambda, EC2, EKS) or misconfigurations such as timeouts or incorrect integration responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS infrastructure-level issues (rare)&lt;/strong&gt;&lt;br&gt;
These could include internal AWS errors or regional service disruptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Causes of API Gateway Failures&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Misconfigured integrations (e.g., VPC links, request/response mapping templates)&lt;/li&gt;
&lt;li&gt;Backend timeouts&lt;/li&gt;
&lt;li&gt;Incorrect or missing HTTP status code mappings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;API Gateway Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS provides a rich set of metrics for API Gateway that align with the three golden signals of observability: traffic, errors, and latency. These metrics are essential for monitoring the health, performance, and reliability of your APIs — helping you detect issues early and respond proactively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Gateway Metrics – Grouped Summary&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Traffic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws.apigateway.count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Total number of API requests received&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.count.p50&lt;/code&gt; - &lt;code&gt;.p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Percentile distribution of request count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.hits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Total hits from traces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.hits.by_http_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hits grouped by HTTP status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage.hits&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hits per deployment stage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage.hits.by_http_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stage-level hits by HTTP status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws.apigateway.4xxerror&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Client-side errors (e.g., invalid request, unauthorized)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.4xxerror.p50&lt;/code&gt; - &lt;code&gt;.p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Percentiles of 4xx error rates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws.apigateway.5xxerror&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Server-side/API errors (e.g., backend failure)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.5xxerror.p50&lt;/code&gt; - &lt;code&gt;.p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Percentiles of 5xx error rates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws.apigateway.latency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Total time from request to response (includes backend)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.latency.p50&lt;/code&gt; - &lt;code&gt;.p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Percentile breakdown of total latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.latency.minimum&lt;/code&gt; / &lt;code&gt;.maximum&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Min and max observed latency values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Integration Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aws.apigateway.integration_latency&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Time spent in the backend integration only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.integration_latency.p50&lt;/code&gt; - &lt;code&gt;.p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Percentile breakdown of backend latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;aws.apigateway.integration_latency.minimum&lt;/code&gt; / &lt;code&gt;.maximum&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Min and max integration latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracing / Duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Trace-based total API duration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.duration.by_http_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Duration per status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage.duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Duration per stage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage.duration.by_http_status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stage duration by status code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracing / Apdex&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage.apdex&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;User satisfaction score (Apdex) per stage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Meta&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Base trace for API Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;trace.aws.apigateway.stage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Trace identifier for specific stage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;API Gateway Tracing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A best practice is to enable tracing for Application Performance Monitoring (APM) on your backend services—such as AWS Lambda or microservices running on ECS, EKS, or EC2. Enabling tracing automatically provides you with the API Gateway tracer view, giving detailed insights into the flow and performance of your APIs.&lt;/p&gt;

&lt;p&gt;In the example below, I have enabled tracing for an AWS Lambda backend, which allows me to view the API Gateway trace data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuuijykr6g3hext7w5mvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuuijykr6g3hext7w5mvt.png" alt="API Gateway Trace View" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The example below shows a trace starting from API Gateway, capturing the end-to-end flow through the backend Lambda function and any other integrated services&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1ny9lmkin5nnr3f0x2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1ny9lmkin5nnr3f0x2z.png" alt="A Trace starting from API Gateway" width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service Level Indicator (SLI) Dashboard for API Gateway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, you need to bring everything together and create a single source of truth dashboard for API Gateway, which provides insights into traffic, errors, and latency. It should include request volume and trends to help identify potential issues promptly.&lt;/p&gt;

&lt;p&gt;The dashboard should also highlight:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failed traces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traces taking more than x seconds — useful for identifying slow requests passing through API Gateway that require further investigation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relevant logs for deeper analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A combination of all these elements will give you a comprehensive view of your API Gateway, enabling effective monitoring and faster troubleshooting of any potential failures or performance issues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit1uciogfexwjhx0geqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fit1uciogfexwjhx0geqy.png" alt="API Gateway Dashboard" width="689" height="834"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that wraps up a complete guide to achieving observability for Amazon API Gateway using Datadog.&lt;/p&gt;

</description>
      <category>amazonapigateway</category>
      <category>aws</category>
      <category>observability</category>
      <category>sre</category>
    </item>
    <item>
      <title>CloudFront Observability Best Practices with Datadog</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Wed, 02 Jul 2025 08:51:56 +0000</pubDate>
      <link>https://dev.to/aws-builders/cloudfront-observability-best-practices-with-datadog-2a6p</link>
      <guid>https://dev.to/aws-builders/cloudfront-observability-best-practices-with-datadog-2a6p</guid>
      <description>&lt;p&gt;&lt;strong&gt;Amazon CloudFront&lt;/strong&gt; is Amazon's own Content Delivery Network (CDN), designed to speed up content delivery to users by distributing it across a global network of edge locations. CloudFront caches content closer to users, thereby reducing latency.&lt;/p&gt;

&lt;p&gt;You can explore the CloudFront full documentation &lt;a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;To make sure we’re aligned on the fundamentals, I’ve created an CloudFront Gateway summary below. It gives you a quick overview of the core capabilities this service offers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobd5npuu8h0o2tfsnyc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobd5npuu8h0o2tfsnyc0.png" alt="CloudFront Essentials" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When using Amazon CloudFront, it’s essential to enable complete visibility into what’s happening at that layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leverage CloudFront Metrics for Performance and Latency Observability&lt;br&gt;
Start with the default CloudFront metrics, which give valuable insights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requests&lt;/strong&gt; – Tracks the number of HTTP/HTTPS requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total Error Rate&lt;/strong&gt; – Monitors the overall error rate, including both 4xx and 5xx errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4xx and 5xx Error Rate&lt;/strong&gt; – Separates client and server errors for more granular analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bytes Downloaded/Uploaded&lt;/strong&gt; – Helps track data volume and monitor trends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To get deeper visibility, enable additional CloudFront metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache Hit Rate&lt;/strong&gt; – Shows the percentage of requests served from the cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Origin Latency&lt;/strong&gt; – Measures how long CloudFront takes to start responding when content comes from the origin (not the cache).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error Rate by Status Code&lt;/strong&gt; – Breaks down errors further (e.g., 401, 403, 502) for precise troubleshooting.&lt;/p&gt;

&lt;p&gt;These metrics give you a clear view of what’s happening inside your CloudFront distribution.&lt;/p&gt;

&lt;p&gt;To enable CloudFront metrics, first complete the Datadog AWS integration via the Datadog Integrations page, and then enable CloudFront metrics&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3euswy7kln6d10fvdex7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3euswy7kln6d10fvdex7.png" alt="Datadog CloudFront Integration" width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You will be able to see the CloudFront metrics via the Datadog Metrics Explorer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyuyqwv7wc17dmabqeytf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyuyqwv7wc17dmabqeytf.png" alt="Datadog CloudFront Metrics" width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use CloudFront Logs to Accelerate Troubleshooting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In order to ship CloudFront logs to Datadog, you need to configure the Datadog Forwarder Lambda function, add a trigger, and set up CloudFront as a log source.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8gfpvx1zmruj617px3y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8gfpvx1zmruj617px3y.png" alt="Datadog AWS Log Forwarder Lambda" width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.datadoghq.com/logs/guide/forwarder/?tab=cloudformation" rel="noopener noreferrer"&gt;Datadog AWS Log Forwarder Configuration&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enable CloudFront access logs&lt;/strong&gt; (delivered to Amazon S3) to analyze user behavior and troubleshoot issues. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodfe3oym5y5cwwripd4g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodfe3oym5y5cwwripd4g.png" alt="CloudFront Logs" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logs help you observe:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache Optimization&lt;/strong&gt; – Improve cache hit/miss rates to maximize CDN benefits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traffic Patterns&lt;/strong&gt; – Understand who is accessing your content, from where and when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Issues&lt;/strong&gt; – Identify regions or requests experiencing high latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error Analysis&lt;/strong&gt; – Discover why certain requests fail or aren't cached.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security&lt;/strong&gt; – Detect suspicious activity or unauthorized access attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enable Tracing for Code-Level Visibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enable tracing tools such as AWS X-Ray or Datadog APM to trace requests across services. This allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pinpoint performance bottlenecks&lt;/li&gt;
&lt;li&gt;See what’s happening inside your code during a request&lt;/li&gt;
&lt;li&gt;Correlate CloudFront performance with backend services&lt;/li&gt;
&lt;li&gt;Tracing adds depth to your observability stack and helps you find issues faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bringing It All Together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combining CloudFront Metrics, Logs, and Traces gives you complete observability of your CDN layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5pqcxp3q8vt9km62skq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5pqcxp3q8vt9km62skq.png" alt="CloudFront Observability" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s Next: Turn Observability into Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once you have visibility, use it to continuously improve:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Optimize Cache Hit Rate&lt;/strong&gt;&lt;br&gt;
Analyze cache behavior (hit/miss ratio). The goal of CloudFront is to serve the majority of requests from the cache, which improves speed and reduces origin load. Monitor trends and assess how new deployments affect caching. Constant observation leads to measurable improvements.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Fine-Tune Cache Configuration&lt;/strong&gt;&lt;br&gt;
Review and adjust cache TTLs, headers, cookies, and query string settings. Use cache policies and origin request policies for better control and efficiency.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Identify and Resolve Latency Hotspots&lt;/strong&gt;&lt;br&gt;
Use Origin Latency metrics to detect slow origins or network bottlenecks. Continuously monitor and improve based on findings.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Set Up Alerts&lt;/strong&gt;&lt;br&gt;
Configure alerts for high error rates (4xx/5xx), increasing latency, or dropping cache hit ratios. Early alerts help resolve issues before they impact users.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Use Geo and Device Insights&lt;/strong&gt;&lt;br&gt;
Analyze where your traffic comes from and what devices are used. This helps optimize delivery strategies and detect anomalies or unauthorized access.&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Correlate Data Across Services&lt;/strong&gt;&lt;br&gt;
Link CloudFront data with backend services for end-to-end observability. Enabling tracing across services provides a full picture of request flows and system health.&lt;/p&gt;

&lt;p&gt;With the right combination of metrics, logs, and traces, you can unlock powerful insights into your CloudFront performance, troubleshoot faster, and continuously improve user experience.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Top AWS CloudWatch Anti-Patterns that could derail your Observability strategy</title>
      <dc:creator>Indika_Wimalasuriya</dc:creator>
      <pubDate>Tue, 25 Feb 2025 03:25:14 +0000</pubDate>
      <link>https://dev.to/aws-builders/top-aws-cloudwatch-anti-patterns-that-could-derail-your-observability-strategy-4ed</link>
      <guid>https://dev.to/aws-builders/top-aws-cloudwatch-anti-patterns-that-could-derail-your-observability-strategy-4ed</guid>
      <description>&lt;p&gt;AWS CloudWatch provides a comprehensive service stack to enable end-to-end full-stack observability for your applications, regardless of whether they are deployed as server-based (EC2), container-based (ECS), or serverless (Lambda, EKS/ECS with Fargate). When you start a new project, you can follow a standard approach to enable full-stack observability via CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, you need to enable telemetry for your application. CloudWatch can’t do anything unless your application emits logs, metrics, and traces. At a high level, you can leverage the CloudWatch Agent or AWS Distro for OpenTelemetry for instrumentation to collect logs, metrics, and traces. You can also enable Real User Monitoring (RUM) to capture digital experience-related metrics. The good thing about CloudWatch is that, whether server-based or serverless, all your infrastructure-related direct/indirect metrics are automatically available to you. It’s also a best practice to leverage the insights and analytics provided, such as Container Insights, Lambda Insights, and Application Insights. These are great features for obtaining telemetry data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, it’s about enabling your basic observability, which includes alerting, dashboards, and automation. These days, AIOps use cases like anomaly detection can be leveraged as well. You can create synthetic monitors, define Service Level Objectives (SLOs), and create intelligent alerts.  &lt;/p&gt;

&lt;p&gt;In general, AWS provides the building blocks for you to develop a comprehensive full-stack observability solution.&lt;/p&gt;

&lt;p&gt;Let’s now look at some of the anti-patterns related to CloudWatch that you need to avoid:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Not having an observability plan and relying on pockets of good practices – a “feel-good” approach.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Not keeping track of CloudWatch updates – you’re going to miss a lot of important information.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Not embracing automatic instrumentation when designing and developing your systems.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not moving away from static thresholds&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuring alerts without considering the customer’s needs.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Underestimating the role of GenAI in operations, especially in AWS.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's dive into the details of each of these anti-patterns:&lt;/p&gt;

&lt;p&gt;1). &lt;strong&gt;Focus on pockets of observability instead of focusing on full-stack observability. Yes, full-stack observability is what you should strive for.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full-stack observability includes frontend performance monitoring, which involves:&lt;/li&gt;
&lt;li&gt;Frontend Performance Monitoring: Use Real User Monitoring (RUM).&lt;/li&gt;
&lt;li&gt;Configure Synthetic Canaries for additional proactive monitoring.&lt;/li&gt;
&lt;li&gt;Enable Application Performance Monitoring (APM) to create service and service maps. This will also provide you with tracers and trace maps.&lt;/li&gt;
&lt;li&gt;Send logs to CloudWatch for centralized monitoring.&lt;/li&gt;
&lt;li&gt;Metrics: Now you should have all the metrics needed to perform comprehensive observability in your application. Ensure you follow the four golden signals approach when developing your metrics: traffic, error rate, latency, and saturation (capacity-related).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2). &lt;strong&gt;Not keeping track of what AWS is doing with CloudWatch. It’s the easiest way to miss out on some great capabilities.&lt;/strong&gt;  Yes, you may wonder why I'm mentioning this, but it's based on my experience. AWS frequently releases new updates, and I often tend to miss them.&lt;br&gt;
In case you missed it, recent CloudWatch changes include the ability to maintain context in observability data, near real-time network monitoring, database insights for Amazon Aurora, enhanced observability for ECS, enhanced application signals providing transaction spans, and centralized telemetry configuration and visibility.&lt;/p&gt;

&lt;p&gt;3). &lt;strong&gt;Enabling telemetry can be challenging, but it doesn’t have to be difficult. Failing to leverage the automatic telemetry instrumentation provided by CloudWatch Application Signals is a major mistake.&lt;/strong&gt;&lt;br&gt;
CloudWatch Application Signals is a great feature that allows you to automatically instrument your application. This removes the burden of manually enabling metrics, logs, and tracers. Application Signals supports various platforms such as EKS, EC2-based, Kubernetes, Lambda, ECS, and some custom hosting services.&lt;/p&gt;

&lt;p&gt;All the services that are automatically discovered will be enabled with four golden signals-aligned metrics, service pages, and service maps. This is a great feature to fast-track your observability implementation.&lt;/p&gt;

&lt;p&gt;4). &lt;strong&gt;When it comes to alerting, focusing on static thresholds is probably the biggest anti-pattern of all. Not embracing built-in AIOps capabilities, like metric and log anomaly detection, is a mistake you can’t undo.&lt;/strong&gt;&lt;br&gt;
Static thresholds are a thing of the past. We are now in the era of AI, and anomaly detection plays a major role. It’s about balancing your performance and getting alerted when the baseline is breached (either upwards or downwards). CloudWatch provides metric anomaly detection, which is a great capability you must use. CloudWatch log anomaly detection is another excellent feature to stay on top of your logs. Let CloudWatch alert you whenever there is a new error appearing or an increase in existing error conditions. You can enable this for all your log groups.&lt;/p&gt;

&lt;p&gt;5). &lt;strong&gt;Developing a bunch of random alerts covering all corners of the application but still struggling to identify customer experience-related failures.&lt;/strong&gt;&lt;br&gt;
We are very good at creating a lot of alerts, but some of them don’t make sense. For example, if a workload goes down, autoscaling will bring it back up. Yes, you need to find the root cause, but it might have minimal impact on the customer experience. Similarly, high resource utilization is often not a problem, since the system runs just fine. Don’t get me wrong, I’m not saying we shouldn’t monitor these things, but time is precious, and we need to focus on what really matters. Instead of being flooded with non-actionable alerts, we need to focus on issues that directly impact users.&lt;/p&gt;

&lt;p&gt;This is where Service Level Objectives (SLOs) come in. SLOs define what "good" means for your systems and are closely correlated with end-user experience. CloudWatch provides the ability to create and track SLOs. You should focus more on developing SLOs and then build an alert framework around them.&lt;/p&gt;

&lt;p&gt;6). &lt;strong&gt;Skipping GenAI in cloud operations is a mistake. GenAI is here to stay, and AWS has already integrated it to provide you with AI Operations capabilities.&lt;/strong&gt;&lt;br&gt;
CloudWatch is now integrated with Amazon Q Developer. With the new GenAI integration, you'll be able to tap into Q to provide intelligent insights when troubleshooting issues using the telemetry data already residing within CloudWatch. It may take a little time to get used to, but once you get past the initial phase, it will definitely help expedite root cause identification. The world is moving towards AIOps, and this is a great feature to experiment with to reduce SME dependencies as well.&lt;/p&gt;

&lt;p&gt;That’s it! Those are the top 6 CloudWatch anti-patterns I think you should avoid. AWS CloudWatch is one of the best observability suites available in the market. While AWS provides great services, we need to use them correctly to get the best results.&lt;/p&gt;

</description>
      <category>awscloudoperations</category>
      <category>cloudwatch</category>
      <category>awsobservability</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
