DEV Community: Mohammed Anes

AWS DevOps Agent + CDK Python: Building a Real Agentic SRE From Scratch (No Splunk, No 3 Accounts)

Mohammed Anes — Sun, 24 May 2026 07:29:05 +0000

"At 2 AM, a CloudWatch alarm fired on my Lambda function. I did not get paged. The agent woke up instead, investigated the function, correlated logs and metrics, posted the root cause to Slack, and went back to sleep. I only found out the next morning when I checked Slack."

That is the system I built this week. This post walks through every step — architecture, CDK code, webhook signing, Skill upload, and the honest parts that went wrong.

MohammedAnes / agentic-sre

Autonomous incident response with AWS DevOps Agent + CDK Python

Agentic SRE with AWS DevOps Agent

An end-to-end autonomous incident response system built on AWS DevOps Agent When a CloudWatch alarm fires, the agent automatically investigates root cause correlates telemetry, generates a mitigation plan, and posts findings to Slack — without waking up an engineer.

Built as a companion to: "I Built an Agentic SRE on AWS DevOps Agent — Here's What the Official Guide Left Out"

Architecture

CloudWatch Alarm (error rate / duration / throttling)
        │
        ▼
    SNS Topic
        │
        ▼
  Webhook Handler Lambda
  (HMAC-signed payload)
        │
        ▼
  AWS DevOps Agent Space
  (CloudWatch + GitHub + Slack)
        │
        ├── Investigates using Lambda Runbook Skill
        ├── Correlates metrics + logs + deployments
        ├── Generates mitigation plan
        └── Posts root cause to Slack
                │
                ▼
        Agent-Ready Spec → Kiro (code fix)

Single-account setup. Unlike the AWS reference architecture (3 accounts + Splunk) this repo is designed to run in one AWS…

View on GitHub

Why I Built This

Every engineer who has been on-call knows the pattern. Alarm fires. You open the laptop half-asleep. CloudWatch. Logs. Recent deployments. Metrics. Cross-reference timestamps. Forty-five minutes later: someone deployed bad code two hours ago. The fix is a one-line rollback.

That entire investigation loop is repeatable, mechanical, and does not require a human. AWS DevOps Agent can do it autonomously. The moment an alarm fires, the agent starts investigating — correlating telemetry, reading your runbooks, checking deployment history, and posting findings to Slack. You get involved only when there is a real decision to make.

I wanted to build a working, single-account version of this without requiring Splunk or an enterprise AWS setup. Here is what I built and what I learned.

Before You Start: What Already Exists (And Why This Is Different)

I did my research before writing a single line of code.

The AWS official blog covers an end-to-end agentic SRE using DevOps Agent. It is thorough. It also requires a 3-account setup with Splunk, which most developers do not have. There is also an official aws-samples/sample-aws-devops-agent-cdk repo — but it is TypeScript CDK only, with no webhook pipeline, no custom Skill, and no demo trigger scripts.

Here is the exact gap this post fills:

Topic	AWS Official Blog	aws-samples CDK repo	This Build
Language	Node.js snippets	TypeScript CDK	Python CDK
Accounts needed	3 accounts	1-2 accounts	1 account
Splunk required	Yes	No	No
Full webhook handler	Partial snippet	Not included	Complete Python Lambda
Custom Skill / Skill zip format	Mentioned	Not included	Full runbook + frontmatter gotcha
Demo trigger script	Not included	Basic invoke	One-liner with 4 modes
Real investigation screenshots	No	No	Yes — including agent gaps

If you are a Python CDK developer and you searched for a complete walkthrough — this is it.

Architecture Overview

┌─────────────────────────────────────────────────┐
│              AWS Account (Single)                │
│                                                  │
│  ┌──────────────┐    metrics + logs              │
│  │ Demo Lambda  │──────────────────────────────┐ │
│  │ (broken app) │                              ▼ │
│  └──────────────┘                    ┌──────────────────┐
│                                      │   CloudWatch      │
│                                      │   Alarms (3)      │
│                                      └────────┬─────────┘
│                                               │ state change
│                                               ▼
│                                      ┌──────────────────┐
│                                      │    SNS Topic      │
│                                      └────────┬─────────┘
│                                               │ triggers
│                                               ▼
│                                      ┌──────────────────┐
│                                      │ Webhook Handler  │
│                                      │ Lambda           │
│                                      │ (HMAC signed)    │
│                                      │ Secret from      │
│                                      │ Secrets Manager  │
│                                      └────────┬─────────┘
│                                               │ HTTP POST
│                                               ▼
│                            ┌─────────────────────────────┐
│                            │   AWS DevOps Agent Space    │
│                            │   (agentic-sre-space)       │
│                            │                             │
│                            │  Reads CloudWatch           │
│                            │  Reads GitHub deploys       │
│                            │  Follows SKILL.md           │
│                            │                             │
│                            │  → Root cause analysis      │
│                            │  → Mitigation plan          │
│                            │  → Slack notification       │
│                            │  → Agent-ready spec → Kiro  │
│                            └─────────────────────────────┘
└─────────────────────────────────────────────────────────┘

Three CDK stacks deploy in sequence. Stack outputs are passed as cross-stack references — no manual copy-pasting of ARNs.

Prerequisites

Before deploying:

AWS CLI configured (aws configure)
Python 3.12+
Node.js 18+ (required for CDK CLI)
AWS CDK v2 installed: npm install -g aws-cdk
AWS DevOps Agent available in your region — currently supported in us-east-1, us-west-2, ap-southeast-2, ap-northeast-1, eu-central-1, eu-west-1 only
A GitHub account (for deployment correlation)
A Slack workspace where you can add apps

Part 1 — Project Structure

agentic-sre/
├── app.py                          # CDK entry point — wires all 3 stacks
├── cdk.json
├── requirements.txt
├── infra/
│   └── stacks/
│       ├── demo_app_stack.py       # Stack 1: the Lambda being monitored
│       ├── alarm_stack.py          # Stack 2: CloudWatch alarms + SNS topic
│       └── webhook_stack.py        # Stack 3: webhook handler + Secrets Manager
├── demo_app/
│   └── index.py                    # Lambda with 4 controllable failure modes
├── webhook_handler/
│   └── index.py                    # SNS → DevOps Agent forwarder (HMAC signed)
├── skills/
│   └── lambda-investigation-runbook.md
└── scripts/
    └── trigger.py                  # Demo: force alarms or burst invocations

The CDK entry point wires all three stacks with cross-stack references:

# app.py
app = cdk.App()

demo = DemoAppStack(app, "AgenticSreDemoApp", env=env)
alarm = AlarmStack(app, "AgenticSreAlarms", lambda_fn=demo.lambda_fn, env=env)
webhook = WebhookStack(app, "AgenticSreWebhook", alarm_topic=alarm.alarm_topic, env=env)

AlarmStack receives the Lambda function object from DemoAppStack so it can create metric alarms on the exact function ARN. WebhookStack receives the SNS topic ARN from AlarmStack to subscribe the webhook Lambda. No magic strings anywhere.

Part 2 — Stack 1: The Demo App

This Lambda simulates a production microservice with four controllable failure modes:

# demo_app/index.py
def handler(event, context):
    simulate = event.get("simulate", "normal")

    # Failure Mode 1: DB connection error — triggers error rate alarm
    if simulate == "error":
        logger.error("Simulated DB connection timeout")
        raise Exception(
            "DB connection timeout: could not reach postgres://demo-db:5432/app "
            "after 3 retries. Connection pool exhausted."
        )

    # Failure Mode 2: Slow response — triggers duration alarm (P99 near 30s timeout)
    if simulate == "slow":
        delay = random.uniform(25, 28)
        time.sleep(delay)
        return {"statusCode": 200, "body": json.dumps({"latency_simulated": True})}

    # Failure Mode 3: Hard timeout
    if simulate == "timeout":
        time.sleep(29)
        return {}

    # Normal operation
    latency = random.uniform(10, 120)
    return {"statusCode": 200, "body": json.dumps({"status": "ok", "latency_ms": round(latency)})}

💡 Note: Why detailed error messages matter: The error message includes a fake DB connection string. When the agent reads CloudWatch logs, it sees this and correctly identifies a database connectivity issue — even though no real database exists. Your log quality directly affects investigation quality.

Part 3 — Stack 2: CloudWatch Alarms

Three alarms covering the most common Lambda failure categories:

# infra/stacks/alarm_stack.py

self.alarm_topic = sns.Topic(self, "AlarmTopic", topic_name="agentic-sre-alarm-topic")

# Alarm 1: Error rate > 20% over 5 minutes
# MathExpression gives you a RATE, not a raw count.
# 5 errors from 5 invocations (100%) should alarm.
# 5 errors from 10,000 invocations (0.05%) should NOT.
error_rate_alarm = cw.Alarm(
    self, "ErrorRateAlarm",
    alarm_name="agentic-sre-high-error-rate",
    metric=cw.MathExpression(
        expression="errors / invocations * 100",
        using_metrics={
            "errors": lambda_fn.metric_errors(period=Duration.minutes(5), statistic="Sum"),
            "invocations": lambda_fn.metric_invocations(period=Duration.minutes(5), statistic="Sum"),
        },
        period=Duration.minutes(5),
    ),
    threshold=20,
    evaluation_periods=1,
    treat_missing_data=cw.TreatMissingData.NOT_BREACHING,
)
error_rate_alarm.add_alarm_action(cw_actions.SnsAction(self.alarm_topic))
error_rate_alarm.add_ok_action(cw_actions.SnsAction(self.alarm_topic))  # ← don't skip this

# Alarm 2: P99 duration > 24s (80% of 30s timeout)
# Alarming at 80% of timeout gives early warning before
# requests actually start failing.
duration_alarm = cw.Alarm(
    self, "DurationAlarm",
    alarm_name="agentic-sre-high-duration",
    metric=lambda_fn.metric_duration(period=Duration.minutes(5), statistic="p99"),
    threshold=24000,         # milliseconds
    evaluation_periods=2,    # 2 consecutive periods avoids false positives
    treat_missing_data=cw.TreatMissingData.NOT_BREACHING,
)
duration_alarm.add_alarm_action(cw_actions.SnsAction(self.alarm_topic))
duration_alarm.add_ok_action(cw_actions.SnsAction(self.alarm_topic))

# Alarm 3: Any throttling at all
# Throttles = requests being dropped. Zero tolerance.
throttle_alarm = cw.Alarm(
    self, "ThrottleAlarm",
    alarm_name="agentic-sre-throttling",
    metric=lambda_fn.metric_throttles(period=Duration.minutes(5), statistic="Sum"),
    threshold=1,
    evaluation_periods=1,
    treat_missing_data=cw.TreatMissingData.NOT_BREACHING,
)
throttle_alarm.add_alarm_action(cw_actions.SnsAction(self.alarm_topic))
throttle_alarm.add_ok_action(cw_actions.SnsAction(self.alarm_topic))

⚠️ Warning: Don't forget add_ok_action. Without it, the agent never receives the resolution signal and leaves the incident open in Slack forever.

Part 4 — Stack 3: The Webhook Handler

This Lambda bridges CloudWatch → DevOps Agent. Two production patterns worth highlighting:

Pattern 1: Secrets Manager + Module-level Caching

# webhook_handler/index.py

# Module-level cache — persists across warm invocations.
# One Secrets Manager fetch per cold start, zero on warm calls.
_secret_cache = None

def get_webhook_config():
    global _secret_cache
    if _secret_cache:
        return _secret_cache
    client = boto3.client("secretsmanager")
    response = client.get_secret_value(SecretId=SECRET_NAME)
    _secret_cache = json.loads(response["SecretString"])
    return _secret_cache

Never put the webhook URL or HMAC secret in Lambda env vars — they appear in plaintext in the console and CloudTrail.

Pattern 2: HMAC-SHA256 Signing

def sign_payload(payload_str: str, secret: str, timestamp: str) -> str:
    """
    DevOps Agent verifies this signature on every webhook call.
    Format: HMAC-SHA256('{timestamp}:{payload}', secret) → base64
    Timestamp prevents replay attacks.
    """
    message = f"{timestamp}:{payload_str}"
    sig = hmac.new(secret.encode("utf-8"), message.encode("utf-8"), hashlib.sha256).digest()
    return base64.b64encode(sig).decode("utf-8")

Payload Mapping

def build_incident_payload(alarm_detail: dict, sns_message: dict) -> dict:
    alarm_name = alarm_detail.get("alarmName", "unknown-alarm")
    state = alarm_detail.get("state", {}).get("value", "ALARM")

    return {
        # Required — DevOps Agent rejects payloads missing these
        "eventType": "incident",
        "incidentId": f"cw-{alarm_name}-{int(datetime.now(timezone.utc).timestamp())}",
        "action": map_alarm_state_to_action(state),  # created | resolved | updated
        "priority": map_alarm_to_priority(alarm_name),  # HIGH / MEDIUM based on alarm name
        "title": f"[{state}] {alarm_name}",
        # Optional — improve investigation quality
        "description": alarm_detail.get("state", {}).get("reason", ""),
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "service": SERVICE_NAME,
    }

The CDK stack wires SNS → Lambda in one line:

alarm_topic.add_subscription(subs.LambdaSubscription(self.webhook_fn))

CDK automatically generates the correct Lambda resource policy for SNS to invoke it.

Part 5 — Setting Up the DevOps Agent Space

Step 1 — Open the Console

Search "DevOps Agent" in the AWS Console. Switch to us-east-1 (N. Virginia). If the service isn't visible, your region isn't supported yet.

Step 2 — Create the Agent Space

Click "Create Agent Space +" and fill in:

Name: agentic-sre-space
Description: SRE agent for Lambda incident investigation
IAM role → "Create a new AWS DevOps Agent role using a policy template" (auto-creates everything)
Operator App role → "Auto-create a new role"

Click Create Agent Space.

Step 3 — Open the Operator Console

From the Agent Space details page → "Operator Access" → bookmark the URL. This is your investigation dashboard.

Step 4 — Connect GitHub

💡 Note: CLI docs skip this: GitHub requires OAuth via the console BEFORE any CLI association. You cannot skip this step.

Features tab → CI/CD Pipeline → GitHub → "Create a new registration" → "Install and authenticate"

On GitHub: select "Only select repositories" → choose agentic-sre → Install.

Step 5 — Connect Slack

Features tab → Notifications → Slack → Connect workspace

Select #agent-sre-incidents, then in Slack: /invite @AWS DevOps Agent

Step 6 — Create the Webhook (3-Step Wizard)

Integrations → Webhooks → Create webhook

The wizard shows you the required payload schema. Our handler already matches it:

{
  "eventType": "incident",
  "incidentId": "string",
  "action": "created" | "updated" | "closed" | "resolved",
  "priority": "CRITICAL" | "HIGH" | "MEDIUM" | "LOW" | "MINIMAL",
  "title": "string",
  "description?": "string",
  "timestamp?": "string",
  "service?": "string"
}

On Step 3: click Generate, copy both the URL and secret immediately (secret shown only once), then:

aws secretsmanager create-secret \
  --name agentic-sre/devops-agent-webhook \
  --secret-string '{"url":"YOUR_WEBHOOK_URL","secret":"YOUR_SECRET"}' \
  --region us-east-1

Part 6 — The Custom Skill (The Most Underdocumented Feature)

A Skill is a Markdown file containing your SRE runbook. The agent reads it at the start of every investigation and follows your documented sequence instead of reasoning from scratch.

The Frontmatter Requirement Nobody Mentions

The console says: "Upload a zip file containing your skill. The zip must include a SKILL.md file with name and description in the frontmatter."

What it doesn't show you is the exact format. After my first upload failed silently, I found it:

---
name: Lambda Incident Investigation Runbook
description: Use this skill when investigating AWS Lambda function errors,
  timeouts, high duration, or throttling incidents. Guides the agent through
  a 5-step investigation sequence covering alarm confirmation, metric analysis,
  log inspection, deployment correlation, and root cause identification.
---

# Lambda Incident Investigation Runbook

## Investigation Sequence
...

The zip structure must be exactly:

lambda-investigation-skill.zip
└── SKILL.md    ← root level, frontmatter required

What the Default Skill vs Your Custom Skill Looks Like

In the Operator Console I saw:

Read skill: /aidevops/skills/user/understanding-agent-space/SKILL.md

💡 Note: Clarification: understanding-agent-space is a default skill AWS pre-loads into every Agent Space. It ran here because I triggered the alarm manually with no real incident context. When you trigger via burst mode with real Lambda errors, the agent selects your custom skill instead.

Skill path structure:

/aidevops/skills/user/
  understanding-agent-space/    ← AWS default (pre-loaded)
  lambda-incident-investigation-runbook/  ← your custom skill

My 5-Step Runbook

## Investigation Sequence

### Step 1 — Confirm the Alarm
- Which alarm triggered: error rate, duration, or throttling?
- Check state history in CloudWatch for the last 30 minutes
- Note exact time alarm entered ALARM state

### Step 2 — Check Lambda Metrics (Last 30 Minutes)
Pull for function `agentic-sre-demo-app`:
- Errors, Invocations, Duration (p50/p95/p99), Throttles, ConcurrentExecutions

### Step 3 — Check Recent Logs
Query /aws/lambda/agentic-sre-demo-app:
- ERROR or CRITICAL level lines
- "Task timed out after" messages
- Time window: alarm trigger ± 10 minutes

### Step 4 — Correlate With Recent Deployments
Check GitHub for deployments in last 2 hours:
- Was a new version deployed before the incident?
- Does deployment timestamp correlate with metric anomaly?

### Step 5 — Identify Root Cause Category

| Symptom | Likely Root Cause |
|---|---|
| High error rate + exception in logs | Code bug or bad deploy |
| High duration + no errors | Slow downstream (DB, external API) |
| High duration + timeout logs | Lambda timeout too low |
| Throttling | Concurrency limit too low |

Part 7 — Deploy Everything

git clone https://github.com/MohammedAnes/agentic-sre.git
cd agentic-sre

python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt

cdk bootstrap                # first time only
cdk deploy --all             # deploys DemoApp → Alarms → Webhook

Deploy takes 3-4 minutes. Expected outputs:

AgenticSreDemoApp.LambdaFunctionName = agentic-sre-demo-app
AgenticSreAlarms.AlarmTopicArn = arn:aws:sns:us-east-1:...
AgenticSreWebhook.WebhookFunctionName = agentic-sre-webhook-handler

Part 8 — Triggering Investigations

# Force error rate alarm immediately
python scripts/trigger.py error

# Force duration alarm
python scripts/trigger.py slow

# Force throttle alarm
python scripts/trigger.py throttle

# Trigger NATURALLY — 15 real invocations, waits for CloudWatch
# Produces much better investigation quality
python scripts/trigger.py burst 15

# Reset everything to OK
python scripts/trigger.py ok

⚠️ Warning: Critical lesson: trigger.py error calls set-alarm-state — forces the alarm but creates zero real logs or metrics. The agent reports "No information about what triggered this investigation." Use burst 15 instead. Takes 5 minutes to trip naturally but the investigation quality is dramatically better.

Part 9 — What Actually Happened

In Slack (#agent-sre-incidents)

The agent posted real-time investigation updates. The standout behavior: automatic alarm grouping. When two follow-up alarms fired 59 seconds and 2 minutes after the primary, the agent recognized them as related and linked them instead of opening duplicate investigations:

"Investigation linked: [ALARM] unknown-alarm — Same alarm type, same priority (HIGH), triggered 59 seconds after the primary. This is a continuation of the same ongoing issue affecting the system."

This deduplication is built-in — no configuration needed.

In the Operator Console

The investigation timeline showed every step:

Read the default understanding-agent-space skill (pre-loaded by AWS)
Made one datetime call to establish time context
Topology discovery across the account
Metric and log correlation
Root cause + mitigation plan generated

The Topology tab mapped resource relationships — Lambda, API Gateway, CloudFormation, GitHub, X-Ray, Bedrock Foundation Models — before investigation started.

The Investigation Gaps (The Honest Part)

The agent surfaced two gaps worth sharing:

Gap 1:

"Cannot directly query Bedrock AgentCore API to verify runtime operational status — GetAgentRuntime, ListAgentRuntimes, InvokeRuntime, GetGateway APIs are not available in the current toolset."

AgentCore runtime APIs aren't in the DevOps Agent toolset yet. It infers from CloudWatch instead.

Gap 2:

"No information about what triggered this investigation — the investigation was created with N/A for both description and starting point."

This was my fault — I used set-alarm-state. No real invocations, no logs, nothing to investigate. Use burst mode.

✅ Tip: These gaps make the post more credible, not less. The agent being honest about its limits is a feature, not a bug.

Part 10 — Cost Breakdown

Service	What runs	Pricing basis	Monthly cost
Lambda — demo app	~100 test invocations/month	1M req free/month	$0
Lambda — webhook handler	Fires only on alarms	1M req free/month	$0
CloudWatch Alarms	3 standard alarms	$0.10/alarm/month	$0.30
CloudWatch Metrics	AWS-vended Lambda metrics	Always free	$0
CloudWatch Logs	~5 MB/month	$0.50/GB ingested	< $0.01
SNS	< 1,000 notifications/month	1M free	$0
Secrets Manager	1 secret	$0.40/secret/month	$0.40
CDK Bootstrap S3	Staging bucket	S3 standard	~$0.01
Infra subtotal			~$0.71/month
AWS DevOps Agent	Autonomous investigations	2-month free trial	$0 (trial)

After the trial: credits based on AWS Support plan offset the cost — up to 100% for Unified Operations, 75% for Enterprise On-Ramp, 30% for Business+.

You can build and blog about this for under $1/month.

Key Learnings

1. Skills are your biggest leverage point — investigation quality scales directly with runbook quality. Write a good SKILL.md before running anything.

2. Use rate-based alarms — MathExpression for error rate is significantly more signal than raw error count.

3. Secrets Manager + module-level caching — one line over env vars, significant security improvement.

4. Always add_ok_action — without it the agent never resolves incidents in Slack.

5. set-alarm-state = empty investigation — use burst mode for realistic demos.

6. Alarm grouping is built-in — duplicate investigations from repeated alarms are handled automatically.

7. The official CDK sample is TypeScript only — if you're a Python CDK developer, that's the gap this repo fills.

What I Am Building Next

A human approval gate before any remediation executes — Step Functions pauses the agent, sends a Slack approve/reject message, and only continues after human confirmation. That's the MCP Governance pattern applied to DevOps Agent.

Also exploring AgentCore Payments — the agent autonomously paying for premium monitoring data during investigations. Follow-up post coming.

Resources

AWS DevOps Agent — product page
Official Agentic SRE blog post (3-account + Splunk)
AWS DevOps Agent CDK getting started guide
DevOps Agent Skills documentation
aws-samples/sample-aws-devops-agent-cdk (TypeScript — the repo this post complements)

I am Mohammed Anes, an AWS Community Builder in the AI/ML category. Building on AWS DevOps Agent, AgentCore, or Bedrock? Let's connect.

Amazon Nova Act Deep Dive — Perceive, Act, Deploy: How AWS Built a 90%+ Reliable Browser Agent

Mohammed Anes — Sun, 22 Mar 2026 06:59:28 +0000

Raise your hand if this has happened to you:

You write a Selenium script. It works on Friday. On Monday, the site changed a button class, and it's broken.

You switch to Playwright. Better. But the moment a cookie banner pops up at the wrong time, your agent halts, completely lost.

This is the core problem with browser automation: it's rule-based. You're telling it exactly what to click — not what you want to accomplish.

AI agents were supposed to fix this. But the first generation of LLM-powered browser bots had a different problem: give a general LLM one big instruction like "book me the cheapest flight to Delhi", and it would hallucinate steps, lose context midway, or confidently click the wrong thing with zero awareness of failure.

Benchmarks showed state-of-the-art models hitting only 30–60% accuracy on real browser tasks.

Amazon Nova Act was built specifically to close this gap — and it reports over 90% reliability at scale.

Here's the full architecture breakdown.

What Is Amazon Nova Act?

Nova Act is an AWS service for building and managing fleets of reliable AI agents that automate browser-based UI workflows. Introduced by Amazon AGI Labs in early 2025 as a research preview, it moved to general availability with full AWS integrations: IAM, S3, CloudWatch, and Bedrock AgentCore.

The one-line summary: Nova Act is what you get when you train a model specifically and exclusively for browser automation — not bolt an LLM onto existing tools.

What makes it different:

Feature	Traditional Automation	Nova Act
Element targeting	CSS selectors / XPath	Natural language + vision
Breaks on UI change?	Yes, constantly	Rarely — it sees the page
Reliability	30–60% (complex tasks)	90%+
Infrastructure	You manage it	AgentCore handles it
Human escalation	Manual alert setup	Built-in, first-class

The key insight behind the reliability number: vertical integration. Most agentic frameworks take a general model and attach browser tools. Nova Act co-trained the model, orchestrator, and browser actuator together end-to-end. That's what moves the needle from 50% to 90%.

The Core Loop: Perceive → Reason → Act

Every Nova Act workflow runs on one repeating cycle:

📸 Screenshot  →  🧠 Nova 2 Lite Model  →  ⚡ Action  →  📸 Screenshot again

Step 1 — Perceive: A screenshot of the current browser state is taken and passed to the model along with your natural language instruction. No DOM parsing. No HTML inspection. It sees the page visually, the same way a human does.

Step 2 — Reason: Amazon Nova 2 Lite (the custom foundation model powering Nova Act) produces a low-level action plan: click at (x, y), type "Chennai", scroll down 400px, press Enter. Trained with reinforcement learning on in-domain browser data — it predicts the next correct browser action, not just the next token.

Step 3 — Act: The action is executed via Playwright under the hood. Nova Act sits on top of Playwright, translating natural language to precise Playwright commands automatically.

Then a new screenshot is taken and the loop repeats — until the task is done, or until the agent decides it needs a human.

The Most Important Pattern: Atomic `act()` Calls

This is the single design decision that separates Nova Act from everything else. And it's deceptively simple:

Don't give one big instruction. Give many small, precise ones.

❌ This approach has ~50% success:

nova.act("book me the cheapest flight from Chennai to Delhi next Friday")

✅ This approach gets 90%+:

nova.act("go to makemytrip.com")
nova.act("click on 'Flights'")
nova.act("set origin to Chennai")
nova.act("set destination to Delhi")
nova.act("set date to next Friday")
nova.act("click Search")
nova.act("sort results by price")
nova.act("click the cheapest result")

Each act() call:

Takes a fresh screenshot of the current state
Executes exactly one clearly scoped action
Returns a result object you can inspect, assert, and branch on in Python

Because you're writing Python around these calls, you get conditionals, loops, retries, and error handling for free. It's not a black box — it's a library.

from nova_act import NovaAct

with NovaAct(starting_page="https://www.amazon.com") as nova:
    nova.act("search for a noise cancelling headphone under 5000 rupees")
    nova.act("select the first result")

    result = nova.act(
        "what is the price shown on this page?",
        schema={"type": "object", "properties": {"price": {"type": "string"}}}
    )
    print(result.parsed_response)  # {"price": "₹3,499"}

    if result.parsed_response["price"]:
        nova.act("click Add to Cart")

Notice the schema parameter — that's how you extract structured data from any page. No CSS selectors. No XPath. Just natural language + a JSON schema, and Nova Act fills it in from what it sees on screen.

Structured Outputs: Turn Any Website Into an API

This feature deserves its own callout because it's genuinely powerful.

result = nova.act(
    "what are the top 3 products on this page?",
    schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name":   {"type": "string"},
                        "price":  {"type": "string"},
                        "rating": {"type": "string"}
                    }
                }
            }
        }
    }
)

products = result.parsed_response["products"]
# [{"name": "...", "price": "...", "rating": "..."}, ...]

Any website — even one with zero public API — becomes a structured data source. Combine with ThreadPoolExecutor and you're running 10 concurrent extractions at once.

Running Workflows in Parallel

Nova Act is designed for fleet-scale. Here's the concurrency pattern:

from concurrent.futures import ThreadPoolExecutor, as_completed
from nova_act import NovaAct, ActError

def process_vendor(vendor_url: str):
    with NovaAct(starting_page=vendor_url) as nova:
        nova.act("log in using saved credentials")
        nova.act("navigate to pending invoices")
        return nova.act(
            "extract all invoices",
            schema=invoice_schema
        ).parsed_response

vendor_urls = [
    "https://vendor1.com",
    "https://vendor2.com",
    "https://vendor3.com",
]

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(process_vendor, url): url for url in vendor_urls}
    for future in as_completed(futures):
        try:
            data = future.result()
            save_to_s3(data)
        except ActError as e:
            print(f"Failed: {futures[future]} → {e}")

Ten browser sessions, running in parallel, each isolated, each logged — all managed through AWS.

The Full AWS Stack

Nova Act doesn't run in a silo. Here's how it slots into AWS:

┌─────────────────────────────────────────────────┐
│              Developer Tools                    │
│  Web Playground · IDE Extension · Python SDK   │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│           Nova Act Engine                       │
│  Amazon Nova 2 Lite · Orchestrator · Playwright │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│       Amazon Bedrock AgentCore                  │
│  Runtime · Browser Tool · Session Isolation     │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│            AWS Infrastructure                   │
│    ECR · IAM · S3 · CloudWatch · CloudTrail     │
└─────────────────────────────────────────────────┘

AgentCore Browser Tool

Provides a fully managed, cloud-based browser runtime:

Session isolation — every workflow gets its own containerized browser
Parallel execution at scale — no infrastructure to manage
Live viewing + session replay — debug in real time
CloudTrail logging — full audit trail for every browser action
Ephemeral containers — browser environment destroyed after each session (security by default)

Connecting to AgentCore Browser from code:

from bedrock_agentcore.tools.browser_client import browser_session
from nova_act import NovaAct
import boto3

def run_cloud_agent(prompt: str, starting_page: str, nova_act_key: str):
    with browser_session(region="us-east-1") as client:
        ws_url, headers = client.generate_ws_headers()
        with NovaAct(
            cdp_endpoint_url=ws_url,
            cdp_headers=headers,
            nova_act_api_key=nova_act_key,
            starting_page=starting_page,
        ) as nova:
            return nova.act(prompt)

Your agent now runs in a sandboxed cloud browser — not on your laptop.

Deployment in one command

The VS Code extension handles the entire deployment pipeline automatically:

Packages your workflow → Docker container
Pushes image → Amazon ECR
Creates → IAM roles + S3 buckets
Deploys → AgentCore Runtime

Zero manual infrastructure. One click in the IDE.

Cost Effectiveness: The Real Story

Traditional browser automation has hidden costs that never show up in your AWS bill:

Developer time maintaining brittle selectors (breaks every sprint)
Failed workflows causing downstream data errors
Re-running failed jobs, debugging silent failures

Nova Act's natural language instructions are resilient to UI changes. "Click the submit button" works whether it's labelled Submit, Send, or Confirm — blue or orange — left side or right side.

On infrastructure costs, AgentCore's pricing model is genuinely aligned with how agentic workloads behave:

✅ Consumption-based — pay per second of actual CPU/memory use
✅ No charge during I/O wait — agentic workloads spend 30–70% of time waiting for pages to load or LLM responses. That idle time is free.
✅ No reserved instances — no upfront commitment
✅ Free Tier — up to $200 credit for new AWS customers

Practical example: A workflow that spends 60% of its time waiting (page loads, API calls) — you only pay for 40% of wall-clock time. At scale, that compounds significantly.

Real-World Use Cases

🧪 QA and Automated Testing

Your test cases can now read like user stories:

with NovaAct(starting_page="https://yourapp.com") as nova:
    nova.act("log in as a standard user")
    nova.act("add one item to cart and go to checkout")
    nova.act("complete checkout using test card 4111111111111111")

    result = nova.act(
        "is an order confirmation visible?",
        schema={
            "type": "object",
            "properties": {
                "confirmed": {"type": "boolean"},
                "order_id":  {"type": "string"}
            }
        }
    )
    assert result.parsed_response["confirmed"] is True

No selector maintenance. No brittle IDs. Tyler Technologies (public sector software) converted manual test plans to automated suites in minutes using Nova Act — without a single CSS selector.

📋 ERP and Legacy System Automation

Many enterprise systems — legacy CRMs, ERP portals, government platforms — have no API. Nova Act handles them:

for record in crm_records:
    with NovaAct(starting_page="https://legacy-erp.company.com") as nova:
        nova.act("click New Contact")
        nova.act(f"enter '{record['name']}' in the Name field")
        nova.act(f"enter '{record['email']}' in the Email field")
        nova.act(f"select '{record['region']}' from the Region dropdown")
        nova.act("click Save")

🔍 Competitive Intelligence

Monitor competitor pricing or product listings without an API:

with NovaAct(starting_page="https://competitor.com/pricing") as nova:
    result = nova.act(
        "extract all pricing plans and their features",
        schema=pricing_schema
    )
    save_to_s3(result.parsed_response)

🏢 Vendor Portal Processing

Dozens of vendor portals, no API integration, invoice processing and status checks — run them all concurrently with automatic human escalation on exceptions.

Human-in-the-Loop: Built for Production Reality

Production agentic systems will hit edge cases. CAPTCHAs. Unexpected modals. Two-factor prompts. Nova Act handles this without you building custom alerting:

Workflow hits an ambiguous state
Nova Act pauses and sends an SNS notification to a designated supervisor
Supervisor receives a devtools URL — they can inspect and interact with the live browser state
Supervisor takes corrective action
Workflow resumes automatically

This isn't bolted on — it's a core architectural feature. The design acknowledges that 90% reliability means 10% still needs a human. That 10% is handled gracefully.

The Developer Journey: Playground → Code → Cloud

nova.amazon.com/act     →    pip install nova-act    →    VS Code Extension    →    Deploy
   Explore in browser         Write Python workflow       Debug step-by-step      One click to AWS

Quick start:

pip install nova-act
export NOVA_ACT_API_KEY="your_api_key"

from nova_act import NovaAct

with NovaAct(starting_page="https://news.ycombinator.com") as nova:
    result = nova.act(
        "what are the top 5 post titles on this page?",
        schema={
            "type": "object",
            "properties": {
                "posts": {"type": "array", "items": {"type": "string"}}
            }
        }
    )
    print(result.parsed_response)

That's all it takes to extract structured data from any webpage.

For Builders in Restricted Regions

Nova Act is currently only available in US East (N. Virginia). If you're in India, Southeast Asia, or any other region without access — here's what to do right now:

1. Learn the atomic act() pattern today. This architecture pattern is what matters. When access opens up, your mental model is already in place.

2. Approximate it with Bedrock + Playwright. Use Amazon Bedrock's Claude models with tool use + Playwright for browser control. The perceive→reason→act loop is buildable today. You lose the purpose-built model advantage, but you learn the architecture.

3. Read the public GitHub. The aws/nova-act repo is open. The amazon-agi-labs/nova-act-samples repo has CDK examples for Lambda, ECS, and AgentCore. Reading these is free and region-independent.

What's Coming

Amazon has been explicit about the roadmap:

MCP integration — Nova Act workflows calling external tools and APIs mid-session via Model Context Protocol
Strands Agents — Amazon's open-source multi-agent SDK already supports Nova Act as a sub-agent in larger pipelines
Beyond the browser — the same RL training methodology is being extended to more complex real-world task environments
More regions — no official dates, but global expansion is the clear direction

Browser automation is the beachhead. The broader target is any workflow that requires perception, reasoning, and action in a UI environment.

Key Takeaways

6 things to remember from this article:

Vertical integration is why it's reliable — model + orchestrator + actuator trained together

Atomic act() calls are the pattern — small precise steps beat one big instruction

Vision > DOM selectors — screenshots don't break when UI classes change

AgentCore pricing — you don't pay for the 30–70% of time your agent spends waiting

Human-in-the-loop is first-class — production agents hit edge cases; it's designed for that

One-command deployment — Docker, ECR, IAM, AgentCore Runtime handled automatically

Resources

🏠 Product page → aws.amazon.com/nova/act
📖 Official docs → docs.aws.amazon.com/nova-act/latest/userguide
💻 GitHub SDK → github.com/aws/nova-act
🧪 Sample code + CDK → github.com/amazon-agi-labs/nova-act-samples
💰 AgentCore pricing → aws.amazon.com/bedrock/agentcore/pricing
🎮 Web playground → nova.amazon.com/act

Found this useful? Drop a ❤️, share with your AWS friends, and follow for more deep dives on agentic AI on AWS. I'll be posting the "Build it yourself with Bedrock + Playwright" version next — a practical guide for restricted-region builders who want to implement the same perceive→act loop today.

DEV Community: Mohammed Anes

AWS DevOps Agent + CDK Python: Building a Real Agentic SRE From Scratch (No Splunk, No 3 Accounts)

MohammedAnes / agentic-sre

Autonomous incident response with AWS DevOps Agent + CDK Python

Agentic SRE with AWS DevOps Agent

Architecture

Why I Built This

Before You Start: What Already Exists (And Why This Is Different)

Architecture Overview

Prerequisites

Part 1 — Project Structure

Part 2 — Stack 1: The Demo App

Part 3 — Stack 2: CloudWatch Alarms

Part 4 — Stack 3: The Webhook Handler

Pattern 1: Secrets Manager + Module-level Caching

Pattern 2: HMAC-SHA256 Signing

Payload Mapping

Part 5 — Setting Up the DevOps Agent Space

Step 1 — Open the Console

Step 2 — Create the Agent Space

Step 3 — Open the Operator Console

Step 4 — Connect GitHub

Step 5 — Connect Slack

Step 6 — Create the Webhook (3-Step Wizard)

Part 6 — The Custom Skill (The Most Underdocumented Feature)

The Frontmatter Requirement Nobody Mentions

What the Default Skill vs Your Custom Skill Looks Like

My 5-Step Runbook

Part 7 — Deploy Everything

Part 8 — Triggering Investigations

Part 9 — What Actually Happened

In Slack (#agent-sre-incidents)

In the Operator Console

The Investigation Gaps (The Honest Part)

Part 10 — Cost Breakdown

Key Learnings

What I Am Building Next

Resources

Amazon Nova Act Deep Dive — Perceive, Act, Deploy: How AWS Built a 90%+ Reliable Browser Agent

What Is Amazon Nova Act?

The Core Loop: Perceive → Reason → Act

The Most Important Pattern: Atomic act() Calls

Structured Outputs: Turn Any Website Into an API

Running Workflows in Parallel

The Full AWS Stack

AgentCore Browser Tool

Connecting to AgentCore Browser from code:

Deployment in one command

Cost Effectiveness: The Real Story

Real-World Use Cases

🧪 QA and Automated Testing

📋 ERP and Legacy System Automation

🔍 Competitive Intelligence

🏢 Vendor Portal Processing

Human-in-the-Loop: Built for Production Reality

The Developer Journey: Playground → Code → Cloud

For Builders in Restricted Regions

What's Coming

Key Takeaways

Resources

The Most Important Pattern: Atomic `act()` Calls