DEV Community: Nitesh Reddy Challa

How I Deployed Hermes Agent on AWS

Nitesh Reddy Challa — Wed, 24 Jun 2026 22:46:01 +0000

My EC2 instance has a public IP address. It has zero inbound firewall rules. And yet I can reach my AI agent from my phone on Telegram, pull up a full web workspace in my browser, and run shell commands on it — all without opening a single port, without a VPN, and without SSH.

The latest version also splits storage deliberately: persistent agent data stays on EFS, while the Hermes install and Python venv moved to the root EBS volume. That change keeps pip install / hermes update I/O off EFS and brings always-on infra to a highly predictable ~$35/mo.

That's the setup this post is about.

What is Hermes Agent?

Hermes Agent is an open-source AI agent from Nous Research. It's not a chatbot wrapper. It has persistent memory, skills, a file system, a sandboxed terminal backend, and a full web workspace UI. You point it at a model provider and it runs as a daemon — hermes-gateway — serving an OpenAI-compatible API.

The web workspace looks like a proper IDE: chat panel, file browser, terminal, job queue. The Telegram integration is a long-polling bot that connects to the same gateway — no extra server, no webhook, no public URL.

I wanted this running on AWS, backed by Amazon Bedrock (no API keys to rotate, IAM role handles auth), with my agent's memory surviving instance replacements.

Architecture

Your phone (Telegram)
  └─► Telegram servers ──► hermes-gateway long-poll (outbound HTTPS only)

Your laptop (browser)
  └─► aws ssm start-session ──► SSM port-forward :3000
                                   └─► hermes-workspace (loopback only)

EC2 m7g.medium · public subnet · ZERO inbound SG · dynamic public IP
  │
  ├─ hermes-gateway   :8642  (127.0.0.1 only)
  │     ├─ Bedrock inference via IAM role (no API keys)
  │     ├─ Telegram long-poll (outbound HTTPS)
  │     └─ OpenAI-compatible API
  │
  ├─ hermes-dashboard :9119  (127.0.0.1 only)
  └─ hermes-workspace :3000  (127.0.0.1 only)
  │
  ├── EFS /mnt/efs/hermes  (RETAIN · encrypted · uid=10000 access point)
  │     .env · config.yaml · sessions · skills · SOUL.md · logs · state DBs
  │     ↑ persistent agent data — survives instance replacement
  │
  ├── EBS root volume
  │     /opt/hermes-agent      ← hermes venv (pip I/O stays off EFS)
  │     /opt/hermes-workspace  ← workspace UI
  │
  └── Secrets Manager (hermes/runtime)
        API_SERVER_KEY · TELEGRAM_BOT_TOKEN · TELEGRAM_ALLOWED_USERS

Three CDK stacks, deployed in order:

Stack	What it provisions
`HermesNetworkStack`	VPC (1 AZ), public subnet, IGW, S3 gateway endpoint, security groups
`HermesStorageStack`	EFS (RETAIN, encrypted, uid=10000 access point), Secrets Manager
`HermesComputeStack`	EC2 (m7g.medium), IAM (Bedrock-scoped), bootstrap user-data, systemd units

The Security Trick: Zero Inbound Rules

The instinct when deploying anything on AWS is to reach for a private subnet, a NAT Gateway, and VPC interface endpoints. That's the enterprise posture. It's also ~$88/mo in endpoint costs alone before your instance even starts.

For a personal deployment the actual security boundary is not the subnet type — it's what's listening on the instance.

All three services bind to 127.0.0.1 only. The Security Group has zero inbound rules. The public IP on the instance rejects every connection attempt because there is nothing behind it.

# network_stack.py — the entire inbound surface of the instance
self.instance_security_group = ec2.SecurityGroup(
    self,
    "InstanceSg",
    vpc=self.vpc,
    description="Hermes EC2 - zero inbound; egress via IGW. Admin via SSM.",
    allow_all_outbound=True,
)
# No add_ingress_rule calls. Ever.

Admin access is via AWS Systems Manager Session Manager — outbound HTTPS to the SSM service endpoint, no inbound port required. SSM also handles port-forwarding, which is how the workspace reaches your browser.

Telegram uses long-polling. The gateway opens an outbound connection to Telegram's servers and holds it. Telegram pushes messages down that connection. Again: zero inbound.

The result: there is no attack surface on the public IP. Shodan can scan it all day.

The Memory Trick: EFS for Data, EBS for Code

Persistent agent data — SOUL.md, skills, session history, state DBs, the .env with all secrets, the config.yaml — lives on an EFS volume mounted at /mnt/efs/hermes. The hermes binary and venv live on the root EBS volume at /opt/hermes-agent instead.

Why split? EFS Elastic Throughput charges per GB accessed. Moving the venv to EBS removes that install/update path from EFS, keeping steady-state EFS I/O costs around ~$1/mo instead of paying for heavy throughput during dependency updates. See docs/STORAGE.md for the full reference.

The EFS has RemovalPolicy.RETAIN. The access point locks the path to UID 10000. Automatic backups are on with a 35-day window.

# storage_stack.py — the persistence layer
self.file_system = efs.FileSystem(
    self,
    "HermesEfs",
    vpc=vpc,
    encrypted=True,
    removal_policy=RemovalPolicy.RETAIN,       # survives cdk destroy
    lifecycle_policy=efs.LifecyclePolicy.AFTER_30_DAYS,
    throughput_mode=efs.ThroughputMode.ELASTIC,
    enable_automatic_backups=True,
)

self.access_point = self.file_system.add_access_point(
    "HermesAccessPointUid10000",
    path="/hermes",
    create_acl=efs.Acl(owner_uid="10000", owner_gid="10000", permissions="0750"),
    posix_user=efs.PosixUser(uid="10000", gid="10000"),
)

What this means in practice: if the EC2 instance develops a problem, you run cdk deploy and get a fresh one. The new instance mounts the same EFS, reads the same .env, reinstalls the venv to EBS via user-data, and all three systemd services start with the agent's full memory intact. No manual data migration, no re-configuration.

The EC2 root EBS is flagged delete_on_termination=True. Agent data is on EFS (RETAIN); install artifacts on EBS are recreated automatically on each deploy.

Bedrock: No API Keys, IAM Role Does the Work

Hermes connects to Bedrock via the Hermes Bedrock guide. The EC2 instance has an IAM role scoped to bedrock:InvokeModel, bedrock:Converse, and the streaming variants — on specific inference-profile and foundation-model ARNs only.

No API keys anywhere. No key rotation. If the instance is compromised, the blast radius is bounded to Bedrock inference on two specific models. The role cannot touch S3, DynamoDB, other accounts, or anything else.

Two models run in this stack:

Model	Role	Why
`us.anthropic.claude-sonnet-4-6`	Primary (all main agent tasks)	Best reasoning for the price on Bedrock
`us.amazon.nova-lite-v1:0`	Auxiliary (5 background slots)	~85× cheaper than Sonnet for web extraction, vision, summarisation

The us. prefix is the cross-region inference profile — Bedrock routes to us-east-1, us-east-2, or us-west-2 automatically for throughput. You enable both models once in the Bedrock Model Access console and never touch it again.

Cost Breakdown

Infra (always-on, us-east-1)

Component	Detail	≈ Monthly
EC2 `m7g.medium` (Graviton, 2 vCPU / 4 GiB)	730 hrs × $0.0404/hr	~$29.50
EBS gp3 root (30 GiB, encrypted)	venv + workspace on EBS	$2.40
EFS Standard (~64 MB agent data)	$0.30/GiB-mo storage	~$0.02
EFS Elastic throughput I/O	venv/deps on EBS; steady-state session/state access only	~$1/mo
EFS automatic backups	~$0.05/GiB-mo	~$0.50
Secrets Manager	1 secret × $0.40	$0.40
CloudWatch Logs + metrics	ingestion + custom metrics	~$2
NAT Gateway / VPC endpoints	none	$0
Infra total (always-on)		≈ $35/mo

No NAT Gateway. No interface VPC endpoints. The EC2 routes outbound directly through the Internet Gateway. That single architectural decision — public subnet, zero-inbound SG instead of private subnet + NAT — is 58% cheaper than the equivalent private-subnet setup with six VPC endpoints.

Stop it when you're not using it

aws ec2 stop-instances --instance-ids <InstanceId> --region us-east-1

EC2 compute billing stops immediately, and most EFS data-access I/O should stop with the services. EFS storage, EBS, Secrets Manager, and CloudWatch keep billing at ~$8/mo. When you start it again, SSM is ready in ~60 seconds and all three hermes-* systemd units restart automatically. No re-bootstrapping, no re-configuration, agent memory fully intact.

Floor: ~$8/mo when off. ~$35/mo when always-on.

Bedrock tokens (variable, on top of infra)

Model	Rate	Typical personal use
Claude Sonnet 4.x	~$3/M in · $15/M out	$10–50/mo
Nova Lite (aux slots)	~$0.06/M in · $0.24/M out	< $2/mo

vs. the alternative

ChatGPT Plus is $20/mo. You get no persistent agent filesystem, no terminal backend, no Telegram long-polling, and far less control over where memory and logs live.

The Hermes setup is more infrastructure to own, but that is the point: you own the memory, the skills, the SOUL.md that shapes the agent's persona, the logs, and the conversation history. Stop the instance today, redeploy in six months, and the agent picks up from the same EFS-backed state.

The Setup, Briefly

Enable Bedrock model access — one-time in the console, two models
cdk deploy --all — provisions all three stacks; first boot takes 5–8 min (package installs + workspace build)
Create a Telegram bot via @BotFather, get your user ID via @userinfobot
Add the bot token + your user ID to Secrets Manager (hermes/runtime), sync to EFS, restart gateway
Port-forward `:3000` via SSM to reach the web workspace from your laptop

# Access the workspace from your laptop
aws ssm start-session --target <InstanceId> \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["3000"],"localPortNumber":["3000"]}' \
  --region us-east-1

open http://localhost:3000

After step 4, Telegram just works. Message your bot, get a reply. No additional setup.

What Surprised Me

I started with a private subnet, a NAT Gateway, and VPC interface endpoints for SSM, Bedrock, Secrets Manager, EFS, and CloudWatch. It's what every AWS security guide recommends. It's also ~$88/mo in endpoint costs before a single token is processed.

The insight that unlocked this architecture: the security boundary for a personal agent isn't the subnet — it's what's reachable on the instance. With zero inbound SG rules and all services bound to loopback, the public IP is inert. SSM and Telegram's long-polling handle the two access patterns (admin shell / bot messages) over outbound HTTPS. No VPN, no bastion, no open ports.

The most secure design for this use case turned out to be the simplest one.

Built with Hermes Agent · AWS CDK (Python) · Amazon Bedrock · SSM Session Manager

A Production-Shaped Multi-Agent SRE System on Amazon Bedrock AgentCore

Nitesh Reddy Challa — Fri, 08 May 2026 14:03:23 +0000

At 2 AM, your on-call engineer has four browser tabs open: CloudWatch Logs, CloudWatch Metrics, a runbook wiki, and Slack. They are synthesizing evidence manually — and every fragmented minute is MTTR climbing. Building an AI agent to close that gap sounds simple until you realize you are actually wiring a runtime, a JWT-gated API layer, an MCP transport, memory persistence, guardrails, observability, and an evaluation harness. This post walks through a production-shaped template that does that wiring once — so you swap four files and ship your own domain.

The 7-day demo cost to run the full stack was $2.11 USD.

What this article is: A teardown of a fork-and-ship CDK template for multi-agent systems on Bedrock AgentCore. The built-in exemplar is an SRE incident-response system running against seeded demo fixtures in CloudWatch — not real production data. That's intentional: synthetic fixtures prove the pattern works end-to-end so you can swap in your own data sources with confidence.

To adapt it to your domain: 4 file swaps — MCP server, sub-agent, orchestrator prompt, fixtures. Everything else (Runtime, Gateway, Memory, Guardrails, OTEL, eval harness) doesn't move. Jump to Adapting to Your Domain if you want that first.

The Problem: Manual Incident Response Does Not Scale

When an incident fires, three things break down simultaneously:

Responders gather evidence from disconnected windows (logs, metrics, runbooks)
Operational knowledge lives in heads and wikis, not in the workflow
Synthesis happens manually under pressure — inconsistent and slow

The fix is a single orchestration path: specialized agents gather evidence in parallel, synthesize once, and return a structured answer. That is what this template implements.

Architecture: Strands Agents-as-Tools on AgentCore

Important distinction: This project uses Strands' agents-as-tools pattern — four sub-agents as in-process @tool functions inside a single container. This is architecturally different from Amazon Bedrock Agents' managed multi-agent collaboration feature (separate Agent resources wired via AssociateAgentCollaborator). The trade-off is intentional: agents-as-tools means zero inter-agent network hops, the same call stack, and identical local/deployed behavior. The managed Bedrock Agents approach earns its complexity when you need cross-team ownership or independent release cycles.

User → Cognito JWT → AgentCore Gateway → AgentCore Runtime (ARM64)
                                                │
                                   Orchestrator (any LLM via Strands)
                          ┌──────────────┬──────────────┬──────────────┐
                     log_analyst  metrics_analyst  runbook_agent  security_auditor
                          │              │              │
                       CW MCP         CW MCP       Lambda MCP
                          └──────────────┴──────────────┘
                                         │
                              CloudWatch Logs + Metrics + DynamoDB
                                         │
                          OTEL → CloudWatch Gen AI Observability

The orchestrator holds four sub-agents as tools=[]. The LLM selects which to call based on their docstrings — no hardcoded dispatch logic:

def build_orchestrator(*, session_id: str | None = None, actor_id: str | None = None) -> Agent:
    """Strands orchestrator — four sub-agents exposed as @tool functions."""
    system_prompt = (Path(__file__).parent / "prompts" / "orchestrator.md").read_text(encoding="utf-8")
    agent_kwargs: dict[str, object] = {}
    if memory_enabled():
        agent_kwargs["session_manager"] = build_session_manager(
            session_id=session_id,
            actor_id=actor_id,
        )
    return Agent(
        model=strands_bedrock_model(),          # swappable — one env var
        system_prompt=system_prompt,
        tools=[log_analyst, metrics_analyst, runbook_agent, security_auditor_agent],
        **agent_kwargs,
    )

Adapting to Your Domain: Four File Swaps

Everything outside these four paths is domain-agnostic scaffolding — do not touch it:

Swap	From	To
Custom MCP server	`mcp_custom/runbook_server/`	`mcp_custom/<your_domain>_server/`
Sub-agent	`agent/sub_agents/runbook.py`	`agent/sub_agents/<your_domain>.py`
Orchestrator prompt	`agent/prompts/orchestrator.md`	Add one tool entry + one routing rule (additive only)
Fixtures + eval cases	`fixtures/scenarios/` + `eval/test_cases.jsonl`	Your 3 canonical queries

After the four swaps: make test && make lint → make phase1-demo-debug → DOCKER_BUILDKIT=0 make phase4-deploy.

The scaffolding — Runtime, Gateway, Memory, Guardrails, OTEL, eval harness — does not move. See docs/ADAPT.md for the step-by-step checklist and a worked Jira triage example.

Session & Memory Model

AgentCore provides two distinct persistence layers — keeping these separate is important:

Layer	Scope	What it stores	Lifetime
Runtime session (microVM)	Single invocation	In-flight context, tool outputs, reasoning trace	15-min idle / 8-hr max
AgentCore Memory	Cross-session	Conversation history (session-window, sliding-window, or long-term summarization)	Configurable TTL

Each invocation runs in a dedicated microVM with isolated CPU, memory, and filesystem. When the session ends, the microVM is terminated and memory is sanitized — no cross-session data contamination, even with non-deterministic AI processes. AgentCore Memory is opt-in (AGENTCORE_MEMORY_ENABLED=true); the session ID propagates through every OTEL span automatically.

MCP as Transport and Policy Layer

log_analyst and metrics_analyst share one CloudWatch MCP server process. Specialization happens through per-agent tool filters — one server, two different tool surfaces, zero duplication:

def cloudwatch_mcp_client(*, tool_filters: ToolFilters) -> MCPClient:
    """Same MCP server, different tool surface per sub-agent."""
    return MCPClient(
        lambda: stdio_client(
            StdioServerParameters(
                command="uvx",
                args=["awslabs.cloudwatch-mcp-server@latest"],
                env=_mcp_subprocess_env(),
            )
        ),
        startup_timeout=120,
        tool_filters=tool_filters,  # ← the only difference between sub-agents
    )

The runbook server uses a dual-shape design — local stdio in Phase 1, Gateway-registered Lambda target in Phase 2+. The sub-agent code does not change between modes; only the transport env var changes.

Why Not Step Functions at the Core?

AWS Prescriptive Guidance is explicit: Step Functions handles deterministic, rule-based flows. AgentCore handles AI-native orchestration where the LLM is the workflow engine. Mixing them at the reasoning layer adds latency without benefit.

In this template, Step Functions belongs at the edges — nightly eval harness, human-in-the-loop approval flows, infra lifecycle — not between the orchestrator and sub-agents.

Pattern	Right fit
Single agent, all tools	Simplest — context pressure grows as tools scale
Agents-as-tools (this repo)	Single team, one container, LLM routes, local debuggable
A2A choreography	Cross-team ownership, independent release cycles
Step Functions + agents	Deterministic outer workflow, AI inner reasoning

Enterprise Security: Three-Layer Least-Privilege Boundary

client.create_gateway(
    name=gateway_name,
    protocolType="MCP",
    roleArn=config["role_arn"],
    authorizerType="CUSTOM_JWT",
    authorizerConfiguration={
        "customJWTAuthorizer": {
            "discoveryUrl": _issuer_url(region, config["user_pool_id"]),
            "allowedClients": [config["client_id"]],
        }
    },
)

Three explicit boundaries, each independently enforced:

Layer	Mechanism	What it prevents
Identity	Cognito JWT — `discoveryUrl` + `allowedClients` validated on every request	Unauthenticated callers
Authorization	Gateway IAM service role (`roleArn`) scoped to registered targets only	Lateral movement to unregistered services
Transport enforcement	`AGENT_TRANSPORT_MODE=gateway` in the runtime container	Local stdio bypass in production

Bedrock Guardrails are wired separately at the model layer (agent/guardrails.py) and provisioned via CDK (infrastructure/stacks/guardrail_stack.py) — covering input/output filtering independent of the transport layer.

Honest Eval: What the Scores Actually Mean

The AgentCore LLM-as-judge eval runs three scenarios against the deployed runtime:

Scenario	Status	GoalSuccessRate	Helpfulness
`debug_external_dep_01`	COMPLETED	0.0	0.83 — Very Helpful
`debug_external_dep_02`	COMPLETED	0.0	0.67 — Moderately Helpful
`debug_external_dep_03`	COMPLETED	0.0	0.67 — Moderately Helpful
Error count	—	0	—

GoalSuccessRate 0.0 is a fixture alignment gap, not a system failure. The evaluator matches exact strings ("Stripe," "503," "CircuitBreakerOpen") against agent responses. The agent reasons in natural language ("payment provider," "upstream errors") — the semantics match, the strings don't. Updating expected_markers in eval/test_cases.jsonl to match the agent's vocabulary fixes this without touching the agent.

Helpfulness 0.83 is the meaningful signal — the LLM judge rated the response as something that would actually help an SRE. The runbook was matched, the mitigation steps were numbered and actionable, and the analysis was coherent.

Surfacing this gap explicitly rather than hiding it is the point: partial confidence is a design principle here, not an error state. When evidence is unavailable, the system returns [Partial] — data not retrieved instead of fabricating an answer.

Observed Cost: 7-Day Demo Window

Layer	Approx. cost	Pricing model
AgentCore Runtime	Majority of total	Consumption-based — billed on active CPU only, not LLM wait time
AgentCore Gateway	Small	Per-request
AgentCore Memory	Small	Storage + retrieval ops
Bedrock Guardrails	Small	Per text-unit processed
Cognito (Auth)	Negligible	MAU-based
Total (7 days)	$2.11 USD	Full stack including all layers

The consumption-based Runtime pricing is the key lever: you are not charged while the container waits on model responses. For SRE use cases where invocations are event-driven (not continuous), the economics are favorable.

Why Strands Agents Over LangChain or CrewAI?

Strands Agents is an open-source SDK published by AWS with first-class AgentCore Runtime integration:

OTEL built-in via ADOT auto-instrumentation — no middleware to configure, spans appear in CloudWatch Gen AI Observability automatically
Typed @tool contracts — sub-agents are plain Python functions; their docstrings become tool descriptions the LLM uses for routing
MCP tool filtering via a single tool_filters= kwarg — one server, scoped tool surface per sub-agent
Model-agnostic — swap the model ID in one place (strands_bedrock_model()); Claude, Nova, and others all work

LangChain and CrewAI are valid choices for different constraint sets. Strands fits here because the target is AgentCore Runtime, not a generic cloud environment.

Closing

The hard part of building agentic systems on AWS is not writing the agent logic — it is wiring runtime, auth, MCP, memory, guardrails, observability, and eval into a coherent system you can actually ship and trust. Every one of those layers is already wired here: microVM session isolation per invocation, Cognito JWT gating, OTEL to CloudWatch Gen AI Observability, LLM-as-judge evaluation via AgentCore's on-demand eval API, and CDK IaC for all infrastructure.

Fork it. Swap mcp_custom/runbook_server/ for your domain's data source. Update the orchestrator prompt. Ship. The other eleven services do not move.

Repo: agentcore-multiagent-framework · Adapt guide: docs/ADAPT.md · Run it: follow the First-time deployed demo (recommended path) section in README.md (CDK deploy → token/runtime deploy → seed → demo queries) · Local-only fallback: make phase1-demo-debug