DEV Community: Amit Kayal

A Scaling Lesson Building Production-Grade Agentic AI Systems

Amit Kayal — Tue, 19 May 2026 18:30:50 +0000

A Scaling Lesson Building Production-Grade Agentic AI Systems

One of the early observations we had while designing enterprise AI agents was this:

Giving an agent more tools does not necessarily make it smarter.

In theory, it sounded correct.

If an agent had access to customer systems, payment systems, inventory, shipping, reporting, ticketing, email, scheduling, analytics, and internal knowledge bases — it should become more powerful and autonomous.

But what we observed in real implementations was very different.

The more tools we added, the more unstable the system became.

Not because the model was weak.

Not because the tools were poorly built.

But because the agent’s decision space became too large.

For every user request, the agent had to evaluate all available tools, compare descriptions, infer intent, decide sequencing, and determine the best execution path.

Now imagine doing this with 18 tools:

Customer lookup
Order search
Refund processing
Inventory checking
Shipping tracking
Email sending
Ticket creation
Knowledge base search
Sentiment analysis
Language translation
Calendar scheduling
Report generation
Data export
User authentication
Payment processing
Discount application
Feedback collection
Escalation routing

Initially, everything looked manageable.

But as workflows became more dynamic, we started observing:

wrong tool selection,
unnecessary tool chaining,
higher latency,
increased token usage,
inconsistent execution paths,
and occasional hallucinated actions.

The problem was not intelligence.

The problem was cognitive overload inside the orchestration layer.

Over time, one pattern became very clear:

Agents perform significantly better when their responsibility boundaries are smaller.

In our experience, once an agent moves beyond roughly 4–5 actively usable tools, reliability starts dropping rapidly. Similar enterprise orchestration patterns are now recommending smaller, specialized agents instead of monolithic “super agents.”

That observation changed how we started designing AI systems.

Instead of building one massive “do everything” agent, we moved toward specialized agents with tightly scoped responsibilities.

For example:

A support agent handles:

customer lookup,
ticket creation,
escalation routing,
knowledge retrieval.

A commerce agent handles:

orders,
refunds,
discounts,
payments.

An operations agent handles:

shipping,
inventory,
reporting,
exports.

This immediately improved:

tool accuracy,
execution consistency,
observability,
debugging,
latency,
and operational trust.

But another important learning came later.

Even after distributing tools properly, systems still degraded when too many agents were active simultaneously.

This is something many teams underestimate.

As the number of agents increases, coordination overhead also increases:

more inter-agent communication,
more memory synchronization,
more orchestration reasoning,
more retries,
more conflict resolution,
and more state tracking.

At lower scale, this is manageable.

At enterprise scale, it becomes a serious engineering challenge.

We observed cases where:

agents started waiting on each other,
orchestration layers became bottlenecks,
duplicate reasoning increased token burn,
cascading retries created operational instability,
and observability became extremely difficult.

Multi-agent systems introduce their own scaling complexity around coordination, governance, and orchestration overhead. Most production-grade architecture guidance today recommends keeping orchestration layers as simple as possible.

Over time, we established a few practical thumb rules internally.

Some Practical Thumb Rules We Follow Now

1. Keep Tool Count Small Per Agent

Our practical guideline today is:

3–5 tools → ideal
6–8 tools → manageable with careful prompting
10+ tools → requires routing/filtering layers
15+ tools → usually an architectural warning sign

The issue is not model capability.

It is decision dilution.

2. Every Agent Must Have One Clear Business Responsibility

We avoid mixing domains.

For example:

payments + support,
analytics + execution,
reporting + approvals,
inventory + customer engagement.

The narrower the responsibility boundary, the more predictable the behavior.

3. Start With the Lowest Complexity Possible

One important learning from enterprise orchestration patterns is this:

Do not introduce multi-agent architecture unless the workflow genuinely requires it.

Sometimes:

a prompt is enough,
sometimes a single agent is enough,
sometimes workflows are better handled through deterministic orchestration.

Not every problem needs “AI teamwork.”

4. Avoid Excessive Agent-to-Agent Conversations

Agent collaboration sounds powerful in demos.

But in production:

every interaction increases latency,
every message consumes tokens,
every dependency creates failure paths.

We now aggressively reduce unnecessary conversations between agents.

5. Retrieval Before Reasoning

Instead of exposing all tools to all agents, we first narrow candidates through:

semantic routing,
metadata filtering,
RAG-based retrieval,
workflow classification.

This significantly improves tool selection accuracy and reduces reasoning load.

6. Observability Is Mandatory

Once systems become multi-agent, debugging becomes one of the hardest engineering problems.

We now treat the following as first-class requirements:

distributed tracing,
token tracking,
step-level logging,
execution replay,
agent health monitoring,
retry visibility,
and orchestration graphs.

Without observability, production support becomes nearly impossible.

7. Human Escalation Is Still Critical

One thing we intentionally avoid is trying to automate every decision.

We now introduce human checkpoints for:

financial operations,
policy-sensitive actions,
low-confidence reasoning,
and customer-impacting workflows.

Autonomy without governance becomes operational risk.

What I increasingly believe is that the future of enterprise AI is not one giant super-agent.

It is orchestrated systems of smaller specialized agents collaborating through routing, delegation, memory sharing, and controlled execution.

The real engineering challenge is no longer:
“How many tools can an agent use?”

The better question is:
“How effectively can we reduce the decision burden for each agent while keeping orchestration manageable?”

That has become one of the most important scaling lessons for us while building production-grade agentic AI systems.

How We Are Thinking About This in Cloud Architecture

One important realization for us was that multi-agent systems should not be treated as a single application deployment.

They should be treated as distributed cloud-native systems.

That changes the architecture significantly.

Today, the architecture pattern we increasingly follow looks something like this:

Specialized Agents as Independent Services

Each agent runs independently with:

isolated APIs,
dedicated scaling,
separate observability,
isolated memory/context,
and domain-level permissions.

This reduces blast radius and improves operational governance.

In AWS, this naturally aligns very well with:

Lambda,
ECS/EKS,
event-driven services,
queues,
Bedrock,
and serverless orchestration patterns.

What I personally liked while evaluating newer AWS patterns is how Amazon Bedrock AgentCore is trying to standardize several production concerns around agents. Instead of teams writing custom orchestration glue repeatedly, AgentCore is introducing managed capabilities around:

runtime isolation,
observability,
memory,
identity,
tool gateways,
and orchestration patterns.

One thing I strongly relate to from practical experience is this:

Building the reasoning layer is usually not the hardest part anymore.

The harder part is:

orchestration,
debugging,
tracing,
retries,
governance,
and operational scalability.

That is where systems usually become unstable at scale.

AWS AgentCore Observability is also moving in an interesting direction by treating agent execution visibility as a first-class production capability with:

execution tracing,
token monitoring,
latency tracking,
tool usage visibility,
and CloudWatch integration. ()

When you have multiple agents collaborating dynamically, you need visibility into:

why a tool was selected,
which agent delegated the task,
what context was shared,
where retries happened,
and why execution paths changed.

Without this, production debugging becomes extremely difficult.

Another pattern we increasingly prefer is asynchronous orchestration.

Instead of tightly coupling agents synchronously, we now lean more toward:

queues,
events,
workflow engines,
and loosely coupled communication.

It improves:

resilience,
scalability,
retry handling,
and fault isolation.

Most importantly, it prevents one overloaded agent from slowing down the entire system.

What I increasingly believe is that the future of enterprise AI is not one giant super-agent.

Technical debt handling

Amit Kayal — Mon, 11 May 2026 13:15:12 +0000

Over the years, my opinion on technical debt has changed a lot. Earlier, I used to think technical debt meant bad engineering decisions.

Now I think differently.

In product companies, especially fast-moving SaaS and AI products, some level of technical debt is unavoidable. If teams try to make everything perfect from day one, they usually move too slowly.
The real problem is not technical debt.
The real problem is when nobody knows:

why the shortcut was taken
how long it can survive
what impact it will create later

Personally, I look at technical debt in 3 broad categories:

Strategic debt : Shortcuts taken consciously to move faster, validate ideas, or release quickly.
Operational debt: Things that slowly start hurting deployments, production stability, debugging, support effort, and developer productivity.
Architectural debt: This is the one that becomes dangerous over time. Scaling becomes harder, integrations become messy, releases become slower, and every new feature starts feeling more expensive to build.

I feel AI products make this even more complicated. In normal SaaS systems, debt usually impacts engineering speed. But in AI systems, technical debt can directly affect:

response quality
hallucination handling
latency
observability
model cost
evaluation consistency

And because AI systems are probabilistic, debugging becomes much harder compared to traditional software.

I’ve also seen SaaS platforms suffer heavily from invisible debt because of:

multi-tenant complexity
customer-specific customizations
integrations
deployment dependencies
security and compliance requirements

One weak architectural decision early on can create pain for years.

That’s why I personally prefer making technical debt visible and measurable instead of treating it as a future problem.

Some of the signals I usually watch:

deployment friction
rollback frequency
incident trends
onboarding difficulty for new engineers
release confidence
overall engineering velocity

One pattern I’ve noticed repeatedly:
When team size keeps increasing but delivery speed keeps dropping, technical debt is already affecting the organization.

Learnings while working with long-running AI agents

Amit Kayal — Mon, 11 May 2026 13:12:53 +0000

One of my biggest learnings while working with long-running AI agents is that logging and progress reporting are not optional features when the agent is tightly coupled with a UI — they are part of the product experience itself.

Initially, I used to think of logging mainly from a debugging or engineering perspective. But with agentic systems, especially long-running workflows involving multiple tools, reasoning steps, APIs, retries, or multi-agent coordination, I realized users experience “silence” very differently than traditional applications.
When an agent takes 30 seconds, 2 minutes, or longer without visible progress, users immediately start questioning:

Is the system stuck?
Did my request fail?
Is it doing the wrong thing?
Should I refresh or retry?

That uncertainty destroys trust very quickly.
I learned that users do not just want the final answer — they want confidence that the system is actively working toward the answer. Progress visibility creates psychological assurance. Even simple updates like:

“Analyzing uploaded documents…”
“Fetching data from CRM…”
“Generating recommendations…”
“Validating final response…”

dramatically improve user confidence and patience.
Another major realization was that long-running agents are fundamentally non-deterministic systems. Unlike traditional APIs, agents can:

take different execution paths,
loop through reasoning,
invoke tools dynamically,
retry failed steps,
or spend time resolving ambiguity.

Without structured logging and traceability, debugging becomes extremely difficult because the same input may not always produce the same internal execution path. Modern AI observability emphasize tracing tool calls, reasoning paths, latency, token usage, and execution flow because agent behavior is inherently complex and probabilistic.

I also learned that progress reporting is not only for users — it becomes equally important for engineering and operational visibility. Once agents move into production, observability helps teams identify:

where workflows slow down,
which tool calls fail,
why latency spikes happen,
and where hallucinations or execution deviations originate.

One practical lesson I learned is that UI-integrated agents should expose execution state intentionally, not dump raw logs. There is a difference between:

engineering telemetry,
operational traces,
and user-friendly progress communication.

Users need understandable milestones, while engineers need deep execution traces.
Another important learning was around perceived performance. In many cases, improving progress visibility improved user satisfaction more than reducing actual latency. A 90-second process with clear step-by-step reporting often feels faster and more reliable than a silent 40-second execution.

Today, I strongly believe that for long-running AI agents:

logging is part of reliability,
progress reporting is part of UX,
and observability is part of trust.

Building a Hybrid AWS Microservices Platform with API Gateway, Lambda, ECS, and Load Balancers

Amit Kayal — Mon, 20 Apr 2026 18:41:39 +0000

Building a Hybrid AWS Microservices Platform with API Gateway, Lambda, ECS, and Load Balancers

Introduction

When teams start splitting a large backend into smaller services, the first infrastructure question is usually not "How do we build a microservice?" but "How do we expose many different services safely, consistently, and without creating a networking mess?"

Our architecture provides a practical answer to that problem using a hybrid AWS design:

API Gateway as the front door
Lambda for lightweight serverless capabilities and supporting workflows
ECS Fargate for containerized business services
Internal load balancers for private service routing
Terraform for repeatable, staged infrastructure delivery

The important architectural idea is separation of concerns. Public access, authentication, routing, container execution, and service discovery are all handled by different layers. That keeps the platform easier to scale and much easier to evolve as the number of services grows.

The Core Pattern

At a high level, the platform follows this flow:

A client sends an HTTPS request to API Gateway.
API Gateway applies request-level controls such as API key enforcement, CORS behavior, and route matching.
The request is sent either to a Lambda-backed endpoint or to a private containerized service.
For ECS services, traffic goes through a VPC Link into internal load balancing.
The load balancer forwards the request to the correct ECS service based on path rules.
ECS Fargate runs one or more healthy tasks for that service and returns the response.

This gives a single API surface to consumers while allowing the backend implementation to vary by use case.

Why Combine Lambda and ECS?

A platform like this benefits from using both compute models rather than forcing every workload into one.

Lambda is a strong fit for:

lightweight request handlers
event-driven tasks
simple orchestration
platform support functions
endpoints that do not need a full container lifecycle

ECS Fargate is a better fit for:

long-lived HTTP microservices
containerized frameworks and dependencies
services that need more predictable runtime behavior
APIs that benefit from load balancing, health checks, and horizontal scaling

In our architecture, the design supports both. Some APIs are routed to Lambda-based services, while others are routed to ECS services defined through service configuration. That hybrid model is useful in real organizations because all services do not have the same runtime needs.

A Three-Stage Infrastructure Model

One of the strongest ideas in our architecture is the staged Terraform layout. Instead of deploying everything together, the infrastructure is split into three layers.

Stage 1: Networking

The first stage establishes the network foundation:

VPC selection or creation
public and private subnet discovery or provisioning
internal Network Load Balancer
internal Application Load Balancer
VPC Link for API Gateway
ECS task security group
ALB log storage and network observability components

This stage is intentionally infrastructure-only. No application services are deployed here.

Stage 2: Compute

The second stage provisions the actual execution environment:

ECS cluster on Fargate
ECR repositories for service images
target groups per service
ALB listener and listener rules
ECS service definitions
CloudWatch log groups
Lambda functions used by the platform

This stage consumes outputs from the networking stage so the compute layer never hardcodes network assumptions in its own design.

Stage 3: API Gateways

The third stage exposes services through API Gateway:

a public API for internet-facing consumption
a private API for VPC-only access
route creation from service metadata
VPC Link integrations for containerized services
Lambda proxy integrations for Lambda-backed services
API keys, usage plans, and stage configuration

This split is operationally important. Teams can change routing without rebuilding networking, and they can add services without redesigning the entire platform.

The Request Path for ECS Services

For containerized microservices, the implementation follows a private ingress model.

The path is:

Client -> API Gateway -> VPC Link -> internal NLB -> internal ALB -> ECS service -> ECS task

That may look like one hop too many at first, but each layer has a purpose.

API Gateway

API Gateway is the public control plane. It handles:

TLS termination at the edge
route exposure
API key enforcement
request and header mapping
CORS handling
stage-based deployment

It gives consumers a stable API contract while keeping the backend private.

Why a VPC Link Is Used

ECS services are not exposed directly to the internet. Instead, API Gateway connects privately into the VPC using a VPC Link. That allows the public API layer to reach internal services without making the services themselves public.

This is a strong security pattern because the application runtime stays inside the VPC, but consumers still get a clean managed API endpoint.

Why the Repository Uses Both NLB and ALB

A useful implementation detail in our architecture is that the VPC Link targets an internal Network Load Balancer, and that NLB forwards to an internal Application Load Balancer.

This arrangement provides two separate benefits:

The NLB is used as the stable target for the API Gateway VPC Link.
The ALB performs path-based routing to the actual microservices.

The ALB is what makes many ECS services practical behind one internal entry point. Each service gets its own listener rule and target group, so the platform can route based on URL path rather than provisioning a separate load balancer per service.

How Load Balancing Works

The load-balancing model is service-oriented.

Each ECS microservice contributes:

a base API path
an ALB path pattern
a listener rule priority
a container port
a health check definition

From that metadata, Terraform creates:

one target group per service
one listener rule per service
one ECS service per service

This means the routing layer is not manually duplicated for every new microservice. The service declares its path and runtime settings, and the platform generates the infrastructure around it.

Target Groups

Each target group points to ECS tasks using IP targets. That is the correct choice for Fargate because tasks run with their own elastic networking interfaces rather than on shared EC2 hosts.

The target groups in this repository also use application-level health checks. A task is considered healthy only when its service endpoint responds successfully on the configured health path.

That matters because container startup is not the same as application readiness. A service may be running from ECS's perspective but still not ready to receive traffic.

Listener Rules

The ALB listener is configured once, and each service gets a path-based rule. For example, a service under a quoting path can be matched independently from a service under a product-pricing path.

This keeps the routing layer centralized and avoids deploying a dedicated ALB per service, which would become expensive and operationally noisy as the platform grows.

Health Checks and Traffic Protection

The repository uses health checks in multiple places:

API health endpoints at the application level
ALB target group health checks
ECS service health grace periods
container health checks inside the task definition

That layered approach improves resilience:

unhealthy tasks are removed from target groups
ECS replaces failed tasks
API Gateway continues to route through the same private entry point

The result is a platform that can recover from instance-level failures without changing the public API contract.

How ECS Is Structured

The ECS side of the platform is built for repeatability rather than one-off service definitions.

ECS Cluster

The platform provisions a shared ECS cluster per environment. That allows multiple microservices to run within the same operational boundary while still being isolated at the task and service level.

The cluster uses Fargate, which removes the need to manage EC2 worker nodes. This simplifies operations significantly:

no patching of container hosts
no cluster capacity management at the instance level
easier scaling by task count

Reusable ECS Service Module

Instead of defining each ECS service from scratch, the repository uses a reusable Terraform module for service deployment.

That module is responsible for:

task definition creation
container logging configuration
IAM role wiring
ECS service creation
target group attachment
subnet and security group placement
optional capacity provider strategy

This is a strong platform choice. It makes service onboarding consistent and reduces drift between services.

Task Definitions

Each service runs as a Fargate task with:

a named container image from ECR
CPU and memory settings
environment variables
a health check command
CloudWatch logging

The repository also includes support for an additional X-Ray sidecar container in the task definition pattern, which is useful for distributed tracing in a microservice environment.

Network Mode

Tasks run with awsvpc networking, which gives each task its own network interface and private IP. This is the standard model for ECS on Fargate and is what allows ALB target groups to use IP mode cleanly.

Subnet and Security Group Design

This repository supports both existing/default VPC usage and a more segmented custom VPC model.

That flexibility matters because many teams start in a default-VPC or dev-friendly setup and later move to stricter network isolation for staging and production.

Subnet Placement

The network layer discovers public and private subnets where available. In a custom VPC, the design supports proper private subnet deployment. In a simpler default VPC setup, the platform can fall back to available public subnets when private ones are not present.

This is an important operational nuance:

development environments often optimize for simplicity
higher environments usually optimize for stricter isolation

The repository is built to handle both.

Security Groups

The security model follows least-privilege intent:

ECS tasks accept application traffic from the internal load-balancing layer
services are not directly internet-facing
API Gateway reaches backend services through private network integration

This keeps the application tier out of direct public exposure while still allowing a public API facade.

Config-Driven Service Onboarding

One of the most scalable ideas in our architecture is that services are registered through configuration rather than by handcrafting infrastructure every time.

There is a master service registry that lists enabled services per environment, and each service provides its own deployment metadata, including:

service identity
container port
desired task count
CPU and memory
API base path
ALB path pattern
listener priority
health check behavior
logging retention
autoscaling preferences

This creates a platform model rather than a collection of unrelated microservices.

Adding a new service becomes a repeatable process:

Create the service.
Define its configuration.
Register it in the service catalog.
Build and publish the image.
Apply Terraform stages.

That is much easier to maintain than cloning infrastructure blocks over and over.

Container Delivery with ECR

For ECS workloads, the container supply chain is straightforward:

Build the service image.
Push it to an ECR repository.
Reference the tagged image in the ECS task definition.
Update the ECS service to roll out the new task definition.

Our platform provisions one ECR repository per service, with image scanning enabled. That is a good baseline for a microservices platform because it keeps artifacts separated by service while still following a common naming convention.

There is also an explicit deployment phase between infrastructure provisioning and API exposure where container images are built and pushed. That is a practical real-world step many diagrams omit, but it is essential because ECS cannot run a service until the image exists in the registry.

How Lambda Fits into the Platform

Lambda is used here as a first-class platform option, not as an afterthought.

There are two useful Lambda patterns in our architecture.

1. Lambda as an API Backend

Some services can be exposed through API Gateway using Lambda proxy integration. This is ideal for capabilities that are naturally event-driven, lightweight, or operationally simpler as functions than as always-on containers.

In this model:

API Gateway owns the route
Lambda executes the business logic
API Gateway returns the Lambda response directly

This avoids unnecessary load-balancer and container overhead for smaller workloads.

2. Lambda as a Platform Support Function

Our architecture also provisions Lambda functions that support the overall platform, such as authentication-related or onboarding-related workflows.

This is a smart use of Lambda in a hybrid platform because not every supporting concern needs to run inside ECS.

Authentication and API Protection

Our architecture clearly treats API protection as an API Gateway concern.

The current public API implementation enforces API key usage through API Gateway methods, API keys, and usage plans. The codebase also provisions a supporting API key validation Lambda function and related permissions, which shows the platform is designed to accommodate Lambda-based validation flows where needed.

From a blog perspective, the important architectural takeaway is this:

keep authentication and traffic governance at the gateway layer
keep service containers focused on business logic
keep private workloads private

That separation keeps the platform easier to secure and easier to reason about.

Public and Private API Models

Another strength of our architecture is that it supports both public and private APIs.

Public API

The public API is intended for internet-facing access. It handles:

external client access
API keys and usage plans
CORS behavior
Lambda and ECS route exposure

Private API

The private API is intended for internal or VPC-scoped access. It is useful when services should only be reachable from trusted network boundaries such as internal AWS workloads, integration environments, or enterprise connectivity paths.

This split is helpful when some capabilities should be public and others should remain internal even though they share the same service platform underneath.

Observability and Operations

A microservices platform is only as good as its operational visibility.

Our architecture includes observability at several levels:

CloudWatch log groups for ECS services
CloudWatch logs for Lambda functions
API Gateway stage logging
ALB logging support
VPC flow logging
X-Ray-friendly task patterns

That combination helps answer the most common production questions:

Did the request reach the gateway?
Was it routed to the right backend?
Was the target healthy?
Did the service fail or time out?
Was the problem in networking, routing, or application logic?

Without that layered visibility, hybrid platforms become difficult to troubleshoot.

Scaling Characteristics

This architecture scales well because each layer can evolve somewhat independently.

API Layer Scaling

API Gateway absorbs public traffic without requiring the backend to manage edge-facing concerns directly.

ECS Scaling

ECS services scale by task count. Each service can define:

desired count
minimum and maximum capacity
CPU and memory sizing
autoscaling thresholds

That means heavily used services can scale out without affecting lighter services.

Platform Growth

As more services are added, the platform does not need a new ingress pattern each time. The same path-based routing model continues to work as long as route definitions and listener priorities stay clean.

Alignment with AWS Well-Architected Best Practices

This architecture also aligns well with AWS best-practice design principles, especially the AWS Well-Architected mindset.

Operational Excellence

We have structured the platform so that it is operated as a system rather than as a collection of one-off deployments.

This is reflected in:

staged Terraform deployments for clearer ownership and safer changes
configuration-driven service onboarding
consistent ECS service patterns through reusable modules
standardized logging and deployment workflows

This reduces manual drift and makes operational changes more repeatable.

Security

Security is addressed through layered controls rather than a single protection point.

We have adhered to good AWS security practices by:

placing ECS services behind private networking rather than exposing them directly
using API Gateway as the controlled ingress layer
applying API-level protection at the gateway
using security groups to limit east-west traffic
supporting encrypted log and storage patterns
separating public access from internal service routing

This follows the AWS principle of strong boundaries, least privilege, and defense in depth.

Reliability

Reliability comes from designing for failure at the service and routing layers.

We have incorporated that through:

multi-AZ subnet placement
load balancer health checks
ECS task replacement behavior
target group isolation per service
decoupled gateway and backend layers
staged infrastructure dependencies with clear outputs between layers

This means a failing task or unhealthy target does not require the API surface itself to change.

Performance Efficiency

The architecture chooses the right compute model for the right workload.

That is an AWS best practice because it avoids treating all traffic the same.

Examples include:

Lambda for lighter, event-oriented, or supporting workflows
ECS Fargate for containerized services that need steady HTTP handling
ALB path-based routing for efficient multi-service consolidation
service-specific CPU, memory, and scaling settings

This lets us tune services independently instead of overprovisioning everything at the platform level.

Cost Optimization

Cost optimization is also visible in the design choices.

We are not multiplying infrastructure unnecessarily. Instead, the architecture encourages shared but controlled platform components:

one API layer for many services
one internal routing layer for many ECS workloads
shared ECS cluster patterns per environment
service-level scaling instead of blanket scaling
support for Fargate and optional capacity-provider strategies where appropriate

That is much closer to AWS best practice than provisioning separate ingress and compute stacks for every small service.

Sustainability and Maintainability

Even when sustainability is not called out directly, maintainable designs usually consume fewer engineering and infrastructure resources over time.

The architecture helps here by:

reducing duplicated infrastructure definitions
making service onboarding metadata-driven
encouraging reuse of shared platform components
keeping the public contract stable while backend services evolve

That leads to lower long-term complexity, which is a practical form of architectural efficiency.

Why This Pattern Works Well

This AWS pattern is effective because it balances standardization with flexibility.

It standardizes:

deployment stages
ingress architecture
service registration
load-balancer behavior
logging and health checks
ECS service creation

It stays flexible by allowing:

Lambda-backed endpoints
ECS-backed endpoints
public and private APIs
different service-level scaling and runtime settings
multiple environments with different networking strategies

That is exactly what a growing microservices platform needs.

Practical Implementation Advice

If you want to implement a similar architecture, a good sequence is:

Build the networking foundation first.
Keep all service backends private.
Put API Gateway in front of everything external.
Use ECS Fargate for containerized APIs that benefit from long-lived service behavior.
Use Lambda for support functions and lightweight endpoints.
Register services through metadata, not repetitive infrastructure definitions.
Use path-based ALB routing so many services can share one internal ingress layer.
Add strong health checks and centralized logs before traffic grows.

The key is not just choosing AWS services, but assigning each AWS service a clear responsibility.

Conclusion

Our architecture demonstrates a mature way to implement Lambda and ECS-based microservices through API Gateway without exposing backend services directly.

The architecture uses:

staged Terraform for separation of concerns
API Gateway as the public and private API facade
Lambda where serverless execution makes sense
ECS Fargate for containerized microservices
NLB and ALB together for private, path-aware routing
config-driven onboarding for scale

For teams building an enterprise microservices platform, this is a strong pattern because it supports security, operational clarity, and service growth without forcing every workload into the same runtime model.

Most importantly, it turns infrastructure into a reusable platform. Once that platform is in place, adding the next service becomes much easier than adding the first one.

Lessons Learned

Keeping API Gateway as the front door and backend services private makes the architecture easier to secure and easier to evolve.
Using both Lambda and ECS is more practical than forcing every use case into a single compute model.
Path-based routing through shared internal load balancing scales better than creating isolated ingress infrastructure for every service.
Service onboarding becomes significantly easier when routing, health checks, scaling, and runtime settings are driven by configuration.
Health checks, logging, and observability need to be designed from the beginning; adding them later is much harder in a distributed system.
A staged infrastructure model reduces operational risk because networking, compute, and API exposure can be changed independently.
Standardizing platform patterns early saves substantial effort as the number of microservices grows.

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

Amit Kayal — Mon, 20 Apr 2026 18:25:06 +0000

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

There is a big difference between a slide-deck architecture and an operating system you can trust on a Monday morning.

This implementation captures that difference well. On paper, the idea is simple: create a shared AWS Lambda Managed Instances capacity provider, run latency-sensitive workloads on ARM64, keep the pool warm with EventBridge, prune old Lambda versions before they become operational debt, and wrap the whole thing in a GitHub Actions plus CodeBuild delivery model. In practice, each of those choices changes how you think about performance, cost, blast radius, and developer discipline.

What follows is not a generic cloud post. It is the kind of write-up you produce after actually building and living with the system.

The Real Problem We Were Solving

Traditional Lambda is excellent when you want abstraction and convenience. It becomes less elegant when your workload is sensitive to startup time, carries heavier dependencies, or needs more predictable execution behavior under bursty load.

That is where a Lambda capacity provider changes the discussion.

In this implementation, the platform is built around a shared aws_lambda_capacity_provider that uses ARM64 Graviton instances and auto scaling. The core idea is straightforward: instead of leaving execution placement entirely to the default Lambda fleet, we deliberately provide a managed compute pool that multiple functions can share. That gives us more control over cost-performance characteristics and lets us design around cold-start pain rather than merely complain about it.

The choice is visible in the Terraform:

The provider runs on arm64
Allowed instance types are constrained to m6g.large, m6g.xlarge, m7g.large, and m7g.xlarge
Scaling is set to Auto
The maximum pool ceiling is set to 64 vCPU
The capacity provider is placed in the default VPC, with unsupported Availability Zones filtered out

That last point matters more than it first appears. The code explicitly excludes unsupported AZs such as us-east-1e, which is a good example of operational maturity: the happy path is not enough when the service itself has placement constraints.

How We Actually Created the Capacity Provider

One thing I wanted this platform to avoid was "concept architecture" with no implementation backbone. So the capacity provider here is not described abstractly. It is provisioned directly in Terraform and wired into the Lambda lifecycle in a fairly intentional way.

The build starts in terraform_file/agent_core_sync_cp.tf.

First, the capacity provider itself is created with aws_lambda_capacity_provider. The naming pattern ties it to the service and environment, which is the right instinct for multi-environment operation. The provider is tagged as shared compute for agent workloads, which matters later for discoverability and platform governance.

Second, the provider is placed inside the default VPC, but not blindly. In terraform_file/data.tf, the code:

discovers the default VPC
fetches the default subnets
inspects subnet Availability Zones one by one
excludes unsupported zones such as us-east-1e
optionally caps how many subnets are used

This is a subtle but important design choice. Lambda Managed Instances often create one placement footprint per subnet or AZ. If you do not control subnet spread, you can end up creating more infrastructure surface area than you intended.

Third, the provider uses a dedicated security group rather than inheriting something vague and accidental. The current implementation keeps outbound traffic fully open and allows inbound HTTPS. That is permissive, but it is at least explicit and repeatable. Early-stage platforms benefit from that kind of clarity.

Fourth, the capacity provider gets its own operator role through AWSLambdaManagedEC2ResourceOperator. That is a critical detail. Capacity providers are not just Lambda resources; they need AWS to manage the EC2-backed execution infrastructure on your behalf. If you miss that role, the platform does not really exist no matter how nice your Terraform looks.

Fifth, the instance requirements are opinionated. The code forces arm64 and narrows the fleet to supported Graviton M-family instance types. That is one of the better engineering decisions in this implementation because it converts an architectural preference into an enforceable runtime rule.

Finally, the Lambda function is attached to the capacity provider in terraform_file/lambda_clm_router_agent.tf through capacity_provider_config. That is where the abstraction becomes real. We are not just provisioning a pool and hoping someone uses it later. We are explicitly binding a published Lambda to that pool and tuning:

memory GiB per vCPU
max concurrency per execution environment
ARM64 runtime alignment
published versioning through Lambda aliases

That is the full loop: provision shared compute, constrain placement, grant AWS the operator role it needs, attach live functions to the pool, and then manage the resulting version sprawl with automation. That is what makes this feel like a platform artifact rather than a loose Terraform experiment.

Lesson 1: A Capacity Provider Is Not a Tuning Knob. It Is an Operating Model.

Teams often talk about capacity providers as if they are just a performance optimization. That framing is too shallow.

The moment you move Lambda onto managed instances, you are no longer only buying faster startup. You are adopting a new operating model with very clear implications:

You now care about instance family compatibility
You need to think about subnet strategy and AZ support
You have to reason about pool scaling ceilings, concurrency, and memory per vCPU
You are effectively blending serverless ergonomics with infrastructure accountability

This implementation shows that transition clearly. The CLM router Lambda is not just declared with a runtime and handler. It is attached to the shared capacity provider and explicitly tuned with:

execution_environment_memory_gib_per_vcpu
per_execution_environment_max_concurrency
publish = true
architectures = ["arm64"]

That is the tell. Once we start specifying how execution environments should behave, we are no longer simply "deploying a Lambda." We are shaping compute economics.

The practical lesson here is simple: if you adopt Lambda Managed Instances, treat it like platform engineering, not like a runtime checkbox.

Lesson 2: ARM64 Delivers Real Value, but Only if You Respect Service Constraints

One of the strongest decisions in this implementation is the bias toward Graviton. For Python-heavy agent workloads, ARM64 is usually the right default. The economics are better, and the performance-per-dollar story is often compelling.

But there is an important nuance that the Terraform comments correctly capture: not every EC2 family you might expect is supported in the way you assume. This implementation explicitly avoids unsupported combinations and narrows the fleet to supported M-family Graviton instances.

That is a good lesson in cloud architecture generally: cloud products market flexibility, but production systems survive on constraint management.

The teams that do well with modern AWS services are not the ones that assume every SKU works. They are the ones that encode the service's real boundaries in Terraform so no one has to rediscover them during an incident window.

Lesson 3: Warmup Is Not a Hack. It Is a Deliberate Control Loop.

There is a tendency in engineering circles to treat "warming" as a slightly embarrassing workaround. I think that is the wrong mindset.

This implementation schedules the CLM router Lambda every five minutes through EventBridge. The handler itself is intentionally lightweight and effectively acts as a keep-alive mechanism. That is not laziness. It is an explicit decision to keep the shared pool alive for latency-sensitive traffic.

More specifically, the warmer exists to reduce the probability that the capacity provider has to spin up fresh managed instance capacity for a new invocation path after a quiet period. That is the practical point of the EventBridge rule in terraform_file/eventbridge_cp_arm.tf. By invoking the Lambda on a steady rate(5 minutes) schedule, the platform keeps the execution path warm enough that the shared capacity provider is less likely to fall all the way back to a cold, scale-from-zero posture right before a real request arrives.

The important insight is this: once you care about cold-start predictability, you need a control loop.

That control loop can be:

Provisioned concurrency
Scheduled warmers
Request shaping
A shared managed instance pool

In this design, the team chose scheduled warm invocation plus a shared capacity provider. That is a sensible middle ground. It is cheaper and simpler than overcommitting always-on infrastructure, while still materially reducing the first-hit penalty.

In plain English: the EventBridge warmer is being used here so the capacity provider does not need to spin up a brand-new server footprint every time traffic reappears after idle time. For interactive or latency-sensitive agent workloads, that is a very practical optimization.

The strategic lesson is that warmup should be measured against business latency, not ideological purity. If a five-minute EventBridge schedule protects user experience and keeps cost acceptable, it is doing its job.

Lesson 4: Shared Pools Create Efficiency, but They Also Create Coupling

The capacity provider here is intentionally shared across platform agents and automation services. That is the right move early in a platform journey because it improves utilization and prevents every Lambda from inventing its own isolated infrastructure story.

But shared pools always introduce two forms of coupling:

Technical coupling, because multiple workloads compete for the same execution substrate
Organizational coupling, because one team's deployment patterns can affect another team's cost and performance envelope

That is why the concurrency controls here matter. The CLM router function uses a per-execution-environment concurrency setting, and the environment-specific .tfvars files pin that concurrency to 4. That is more than a performance number. It is a fairness policy.

If I were advising a platform team scaling this pattern, I would say this clearly: shared capacity providers are excellent, but they need quota thinking from day one. Otherwise the first successful workload becomes the first noisy neighbor.

Lesson 5: If You Publish Versions Aggressively, You Need Lifecycle Hygiene on Day One

This implementation makes another good call: the Lambda functions are published, aliased, and then cleaned up with an automated version pruner.

That matters because version sprawl is one of those quiet operational problems that teams ignore until it becomes annoying enough to disrupt deployments. Published versions accumulate quickly when CI/CD is active. If you do not manage them, you eventually pay in clutter, confusion, or hard service limits.

The lambda_version_pruner implementation is stronger than a simplistic cleanup script because it preserves what actually matters:

It scans all Lambda functions
It filters only functions associated with the target capacity provider
It lists all aliases and protects aliased versions
It keeps the latest N published versions
It deletes everything older that is neither current nor aliased

This is exactly the kind of automation mature teams invest in. Not glamorous. Very valuable.

There is also an understated platform principle here: rollback is not just about keeping artifacts. It is about keeping the right artifacts. By preserving aliased versions, the pruner respects deployment intent rather than blindly optimizing for tidiness.

There is also a more practical capacity-provider reason for doing this, and it deserves to be stated directly.

When you run a shared Lambda Managed Instances pool, you want the platform to spend its effort on the versions that are actually serving traffic, warming correctly, or remaining available for safe rollback. If old published versions keep accumulating forever, three unhealthy things tend to happen:

operators lose clarity on which versions are still meaningful
rollback and alias management become noisier than they should be
the shared platform carries more deployment residue than useful runtime intent

Strictly speaking, deleting old Lambda versions does not magically increase CPU on the capacity provider. What it does do is improve platform hygiene around the shared pool. It ensures that the versions attached to aliases, warmup patterns, and deployment workflows remain deliberate and limited. In other words, it improves capacity-provider utilization indirectly by reducing version sprawl around the workloads that consume that shared capacity.

That matters in real operations. The healthier the deployment surface is, the easier it is to reason about what is warming, what is active, what can be rolled back, and what should no longer influence the platform at all.

So the version pruner is not just a cleanup utility. It is part of making the shared capacity provider operationally efficient. Not by adding raw compute, but by reducing noise, protecting the versions that matter, and keeping the platform focused on live execution paths instead of historical leftovers.

Lesson 6: GitHub Actions Should Orchestrate. CodeBuild Should Execute.

Architecturally, the CI/CD model here is sensible.

GitHub Actions is used as the control plane:

branch-based triggering
security scanning
environment selection
AWS credential injection
build orchestration

AWS CodeBuild is used as the execution plane:

Terraform install
terraform init
terraform validate
terraform plan
terraform apply

I like this split. It keeps GitHub Actions lightweight and makes AWS the place where the actual infrastructure mutation happens. That usually gives better access control, cleaner auditability, and fewer surprises around long-running plan or apply steps.

The buildspecs pin Terraform 1.12.2, install the CLI explicitly, and then execute plan/apply flows with environment-specific variable files. That is exactly the kind of boring repeatability you want in infrastructure delivery.

This is one of the most practical lessons from the implementation: do not force GitHub Actions to be your full deployment runtime if AWS-native execution gives you better control.

Lesson 7: CI/CD Maturity Is Not About Having a Pipeline. It Is About Where the Gates Actually Are.

The implementation also reveals a harder truth: CI/CD design is won or lost not by YAML volume, but by trigger discipline.

There are some good instincts here:

Dev deployment is chained off a successful security workflow
Security scanning runs on push and PR for dev
PR security review is scoped only to actual code and infrastructure changes
Environment-specific secrets are used for AWS access

That said, the current implementation also shows the kinds of issues every fast-moving team encounters:

The dev deploy workflow is triggered by Security Checks (Push), not by a broader quality gate such as tests plus security plus static analysis
The QA workflow is currently triggered on pull_request to qa, yet it also includes an apply stage, which is a risky combination
The sanity workflow references a different CodeBuild project naming pattern, which looks like copy-forward drift from another implementation
One dev apply step mixes generic and environment-specific secrets in a way that deserves tightening

This is not a criticism of the team. It is actually the most authentic part of the system.

Real pipelines evolve through reuse, renaming, urgency, and partial migration. The useful engineering habit is not pretending they are pristine. It is recognizing that pipeline drift is itself a production concern.

My blunt lesson here is this: CI/CD is software. It needs the same review rigor as application code.

Lesson 8: Documentation Drift Is a Reliability Signal

The README here is ambitious and useful, but parts of it clearly describe a broader or earlier architecture than the exact files currently present. That mismatch is more important than most teams realize.

When documentation and implementation diverge, three things happen:

new engineers learn the wrong system
reviewers approve changes with outdated mental models
incidents take longer to resolve because operators trust stale diagrams

One of the best engineering habits is to treat documentation drift as an operational bug, not as a cosmetic issue.

This implementation makes that case well. The code is the source of truth. The docs are directionally strong, but some names, workflow descriptions, and file references have clearly moved over time. That is normal. What matters is catching it before the next engineer builds decisions on old assumptions.

Lesson 9: The Default VPC Is Fine for Speed, but It Should Be a Conscious Temporary Convenience

The Terraform intentionally uses the default VPC and default subnets, then layers in filtering and a custom security group. For early velocity, that is an acceptable choice. It removes friction and makes the first deployment much easier.

But teams should be honest about the tradeoff.

Using the default VPC accelerates setup. It does not provide the same clarity, segmentation, or policy hygiene that a dedicated workload VPC eventually should. The inbound HTTPS rule from 0.0.0.0/0 is another example of where a practical early-stage decision should later be revisited with a more opinionated security posture.

My view is simple: default VPC usage is fine when it is a speed decision. It becomes dangerous when it silently hardens into architecture.

Lesson 10: Least Privilege Usually Loses the First Battle. Do Not Let It Lose the War.

The Lambda IAM policy for the router function is broad. Very broad.

That is common when a platform team is trying to unblock integration work quickly across S3, SQS, SNS, DynamoDB, Bedrock, AppSync, logs, X-Ray, and secrets. The version pruner is noticeably tighter, which is encouraging. But the broader pattern remains familiar: the first version of a system usually over-grants.

The lesson is not "never do that." The lesson is "know when you are doing it, and schedule the hardening work while the platform is still comprehensible."

Security debt compounds. The longer a wide-open policy survives, the more invisible it becomes.

What This Repo Gets Right

If I strip away the drift and focus on the platform instincts, this implementation gets a lot right:

It treats capacity provider infrastructure as shared platform capability, not one-off function plumbing
It optimizes for ARM64 economics instead of defaulting to x86 out of habit
It acknowledges cold starts as a business problem and addresses them operationally
It preserves rollback safety with aliases while still pruning version sprawl
It separates orchestration from execution in CI/CD
It encodes AWS service constraints in Terraform comments and defaults, which reduces tribal knowledge

That is a strong foundation.

What I Would Improve Next

If I were turning this into the next version of a production-grade internal platform, I would prioritize the following:

Tighten naming consistency across the implementation.
The capacity provider name appears in slightly different forms across resources. That is how automation misses its target. Shared naming locals should eliminate this class of error.
Make QA and production promotion rules stricter.
A PR-triggered apply path should be removed. Plan on PR, apply on protected branch or approved environment gate is the cleaner model.
Run Terraform from a single explicit working directory.
The current layout places Terraform under terraform_file/, while some buildspec commands read like root-level execution. That ambiguity should be eliminated.
Move from broad IAM toward intent-based policies.
Especially for the router Lambda, policy scope should narrow as the workload stabilizes.
Revisit networking posture.
The default VPC is fine for speed; a dedicated VPC model is better for longevity, auditability, and controlled ingress.
Add stronger deployment quality gates.
Security review is useful, but infrastructure promotion should also hang off validation, tests, linting, and explicit approval where appropriate.
Add platform observability as code.
CloudWatch alarms, dashboarding, and cost visibility for the capacity provider should be treated as first-class Terraform resources, not follow-up tasks.

The Bigger Technical Lesson

The biggest takeaway from this implementation is not about Lambda specifically.

It is about how modern platform teams should build.

We should absolutely chase better cost-performance curves. We should use managed primitives aggressively. We should automate the boring work. But we also need the discipline to encode what we learn while the system is still small enough to reason about.

What makes this useful is that it shows both halves of real engineering:

the architectural intent
the implementation scars

That combination is where credible engineering judgment comes from.

Anyone can present a clean target state. The harder and more useful skill is building systems that survive contact with deployment friction, service constraints, naming drift, and operational reality.

That is what this implementation is doing. And that is why the lessons here matter.

Closing Thought

Capacity providers, warmers, version pruning, and GitHub-driven delivery are not separate topics. They are all answers to the same technical question:

How do we make cloud systems faster, cheaper, safer, and more repeatable without turning every application team into a specialized infrastructure group?

In this implementation, the answer was to centralize the hard platform decisions, automate the hygiene, keep the runtime warm where it matters, and stay honest about the places where the system still needs tightening.

That is not just good infrastructure work.

That is good engineering practice.

Lessons I learned building a memory-aware agent with Amazon Bedrock AgentCore Runtime

Amit Kayal — Mon, 20 Apr 2026 18:10:16 +0000

Lessons I learned building a memory-aware agent with Amazon Bedrock AgentCore Runtime

When I started building an agent with Amazon Bedrock AgentCore Runtime, I thought the difficult parts would be model selection, tool wiring, and deployment. Those certainly mattered, but the part that shaped the quality of the agent most was memory.

The first version of the agent could answer single prompts well enough, but it did not behave like a real multi-turn system. Follow-up questions were brittle. The agent lost short-range intent. Tool usage worked, but only within the narrow boundaries of the current prompt. As soon as the conversation depended on what happened one or two turns earlier, the system started to feel less like an agent and more like a stateless inference endpoint.

That experience changed how I approached the design. I stopped thinking about memory as a convenience feature and started treating it as part of the runtime architecture itself. This article is a distillation of the most important lessons I learned while building a short-term-memory-aware agent with Amazon Bedrock AgentCore Runtime and Strands.

Lesson 1: An agent is not really multi-turn until memory is part of the lifecycle

One of the first things I learned is that conversational continuity does not emerge automatically just because the application calls the same runtime repeatedly.

Without short-term memory, the agent only sees the current prompt unless the application keeps reconstructing and replaying history manually. That creates several problems:

previous instructions are easy to lose,
tool chains become fragile across turns,
users have to restate identifiers and intent,
the system becomes increasingly prompt-shaped rather than interaction-shaped.

What became clear to me is that short-term memory is not about storing everything forever. It is about preserving enough recent state for the current conversation to remain coherent.

That distinction matters. I was not trying to build a knowledge base or semantic fact store. I was trying to answer a simpler question: how do I help the agent remember what we were just doing?

Once I framed the problem that way, the architecture became much clearer.

Lesson 2: The cleanest pattern is explicit memory, not implicit transcript magic

Another lesson I learned quickly is that I did not want memory to be hidden behind vague runtime behavior. I wanted the agent code to make memory use explicit:

where memory comes from,
when it is read,
when it is written,
which user it belongs to,
which conversation it belongs to.

That led me to a pattern built around MemoryClient and hooks.

Instead of treating memory like a passive transcript that somehow appears at the edge of the request, I found it much more reliable to think about it as a lifecycle-managed dependency:

create a short-term memory resource,
pass the memory identity into the runtime,
read recent turns when the agent initializes,
write new messages as events when the conversation changes.

The important shift for me was this: memory worked best when it was part of the agent object model, not just part of request handling glue code.

Lesson 3: Hooks are where memory belongs

This was probably the biggest implementation insight.

Once I had a Strands-based agent running inside AgentCore Runtime, I needed to decide where the memory logic should live. I could have put everything directly into the entrypoint and manually stitched together request parsing, history retrieval, message persistence, and prompt injection. That would have worked, but it would have made the agent lifecycle harder to reason about.

What worked better was using hooks tied to the agent lifecycle itself:

AgentInitializedEvent
MessageAddedEvent

That structure gave me a much cleaner mental model.

On initialization, the agent needs context before it reasons. That is the right moment to retrieve the most recent turns from memory and inject them into prompt context.

When a new message is added, the conversation state has changed. That is the right moment to persist the latest user or assistant message back into memory.

The core interaction looks like this:

recent = memory_client.get_last_k_turns(
    memory_id=memory_id,
    actor_id=actor_id,
    session_id=session_id,
    k=5,
)

memory_client.create_event(
    memory_id=memory_id,
    actor_id=actor_id,
    session_id=session_id,
    messages=[(text, role)],
)

What I like about this model is that it is deterministic.

memory load happens before reasoning,
memory write happens when conversation state changes,
both operations use the same identity boundaries,
the entrypoint stays focused on request extraction rather than conversation orchestration.

That made the system easier to debug, easier to extend, and much easier to explain.

Lesson 4: Identity is the real memory boundary

Before building this, I thought of memory mostly as a storage problem. In practice, I learned it is just as much an identity problem.

The two identifiers that mattered most were:

actor_id
session_id

This separation ended up being foundational.

Why `actor_id` matters

actor_id is the user boundary. If that identifier is unstable, absent, or inconsistent, memory quality degrades immediately.

What I learned is that a memory system is only as good as the application identity you feed into it. If the same user appears under multiple IDs, the agent cannot retrieve a coherent conversational history. If two users are accidentally mapped to the same identity, memory becomes unsafe.

So one of my strongest takeaways is that actor_id should always come from a stable authenticated user identity, not from an incidental client-generated value.

Why `session_id` matters

session_id turned out to be just as important. A single user does not have just one conversation. They may have multiple active threads:

one troubleshooting flow,
one transcript analysis request,
one abandoned conversation from earlier,
one brand-new task.

Without a session boundary, all of that collapses into one memory stream. The agent might technically “remember,” but it remembers too much of the wrong thing.

That was a key lesson for me: useful memory is not just preserved memory. It is correctly scoped memory.

Lesson 5: The agent should be rebuilt per request, but memory should persist across requests

This was an architectural point that became clearer as I implemented the runtime flow.

The Strands agent instance itself is created per request. That makes sense because each invocation carries request-specific state:

the current user prompt,
the active user identity,
the active conversation session,
the active tool and runtime context.

But memory should not behave like request-local state. Memory has to outlive the agent instance and remain keyed to the same user and conversation across invocations.

That split was important for me to internalize:

agent instance lifecycle is short,
conversation memory lifecycle is longer,
the link between them is established through state and hooks.

Once I started thinking in those terms, the design felt much more natural.

Lesson 6: Deployment is part of the memory design

I originally thought of deployment as a separate concern from conversational behavior. Building this agent convinced me that the two are tightly connected.

The runtime needs to know which memory resource it should use, but I did not want that decision hardcoded in application logic. The better pattern was to resolve the correct memory resource during deployment and pass that identity into the runtime as configuration.

In practice, that meant the runtime received environment-specific values such as:

AGENT_NAME=<agent-name>
MEMORY_ID=<memory-id>

That gave me a few benefits immediately:

the same application code could move across environments,
memory resources stayed aligned with environment boundaries,
the runtime remained configurable without source changes,
the control plane remained the primary place where resource binding happened.

One of the clearest lessons here is that memory should be treated like any other environment-bound infrastructure dependency. If it is not part of deployment, it tends to become a hidden assumption.

Lesson 7: Short-term memory and long-term memory solve different problems

I found it helpful to stop using the word “memory” as if it meant one thing.

Short-term memory answered the question:

"What was happening in this conversation recently?"

Long-term memory answers a different question:

"What durable information should the system remember beyond this immediate interaction?"

For the agent I was building, the short-term problem came first. I needed:

recent-turn continuity,
bounded replay,
session-scoped context,
predictable event retention.

I did not need semantic fact retrieval in the first phase. I did not need vector search for historical knowledge. I needed the agent to remain coherent across adjacent turns.

That was an important design simplification. It kept the first version of the memory architecture focused on event continuity instead of overextending into knowledge retrieval prematurely.

Lesson 8: Recent-turn replay should be bounded

Once I had memory retrieval working, the next question was how much of it to inject back into the agent context.

My lesson here was simple: more memory is not always better memory.

If too much prior conversation is replayed:

prompt size grows,
token cost grows,
stale context starts competing with the current task,
reasoning quality can actually decline.

I found the most practical pattern was to retrieve the last few turns and inject them into prompt context in a compact representation. In this design, that replay window was bounded at five turns.

That gave me a good balance:

enough recent context for continuity,
small enough context for predictable prompt growth,
simple enough formatting to inspect and debug.

This also reinforced another lesson: short-term memory should be operationally understandable. I want to know what context the model saw, not just trust that some opaque memory layer handled it correctly.

Lesson 9: Memory becomes more valuable when tools are involved

The agent I built was not just a conversational shell. It had tools, including domain-specific behavior such as transcript retrieval and AWS interactions.

That is where the value of short-term memory became even more obvious.

In a tool-using workflow, the user often does not repeat the full context every turn. They say things like:

"use the same meeting"
"what did the second speaker say?"
"now summarize that"
"check the S3 output from before"

Without memory, the agent has to reconstruct working state from a single prompt. With memory, the agent has a much better chance of preserving:

the active object under discussion,
the prior user instruction,
the last tool result,
the intended next step.

One of my strongest takeaways is that memory is not just a conversational improvement. It is a workflow improvement. It makes tool orchestration across turns materially more coherent.

Lesson 10: Failure modes need to be designed, not discovered in production

Building this also made me think much more carefully about degraded behavior.

If memory resolution fails and the runtime cannot find a memory resource, the agent may still run. That sounds convenient, but it also means the system may silently shift from stateful to stateless behavior.

That taught me to treat the following as first-class operational conditions:

memory enabled,
memory disabled,
memory load succeeded,
memory write succeeded,
memory resolution failed,
identity inputs were missing or malformed.

The same thing applies to identity mistakes.

If actor_id is unstable, memory becomes fragmented.

If session_id is reused incorrectly, unrelated conversations bleed into each other.

If replay windows grow without discipline, prompt quality degrades.

These are not edge cases. They are part of the normal operating surface of a memory-aware agent.

Lesson 11: Retention, privacy, and compliance show up earlier than expected

Short-term memory sounds lightweight, but it is still stored interaction data.

That means retention policy is not just a platform setting. It is part of the product design. While building this, I became much more aware that memory decisions quickly intersect with:

data handling policy,
privacy expectations,
deletion and retention requirements,
security review,
production observability.

The technical implementation can be elegant, but if these operational questions are not addressed early, the design will be incomplete.

Lesson 12: AgentCore became more useful to me when I treated it as a runtime system, not just a hosting target

This may be the broadest lesson of all.

At first, I thought of AgentCore Runtime mainly as the place where the agent container would run. But while building with memory, I started appreciating it more as a runtime environment with clear operational boundaries:

the runtime executes the agent,
the framework manages reasoning and tools,
the memory plane manages event continuity,
the deployment workflow binds the right resources together.

That view helped me move beyond “deploy a model wrapper in a container” toward “operate an agent system with state, identity, and lifecycle.”

For me, that was the real shift.

The technical pattern I would reuse

If I were building the same class of agent again, I would reuse the same high-level pattern:

Create a dedicated short-term memory resource.
Resolve the correct memory resource during deployment.
Pass memory identity into the runtime explicitly.
Build the agent per request with user and session state.
Load recent turns during agent initialization.
Persist new messages when they are added.
Keep replay windows bounded.
Treat actor_id and session_id as core correctness boundaries.

I would also keep the same mental model:

short-term memory is for continuity,
long-term memory is for durable recall,
hooks are the right place for memory orchestration,
deployment is part of memory architecture,
observability should make degraded memory behavior visible.

Closing thought

The biggest lesson I learned while building with Amazon Bedrock AgentCore Runtime is that memory is not something you sprinkle onto an agent once the rest of the system works. Memory changes the shape of the system.

It affects:

request lifecycle,
identity boundaries,
prompt construction,
deployment,
observability,
privacy,
and tool coherence across turns.

Once I accepted that, the architecture became much more disciplined. The agent became easier to reason about, easier to operate, and much more capable in real multi-turn interactions.

That is the lesson I would carry into any future AgentCore build: if the experience is meant to feel conversational, memory has to be designed as a first-class runtime concern from the beginning.

API Gateway as Websocket

Amit Kayal — Tue, 21 Jan 2025 07:49:42 +0000

API Gateway as websocket

API Gateway as WS Components

Websocket provides bidirectional session aware communication between caller and receiver and a crucial component for realtime application.

Setup API Gateway for WebSocket
- Create a WebSocket API in the Amazon API Gateway console or through IAC.
- Define the WebSocket API route selection expression. Routes here are simply like a bridge to connections e.g.,
  - $request.body.action.
  - Define the following WebSocket routes:
  - $connect: Triggered when a client establishes a connection.
  - $disconnect: Triggered when a client disconnects.
  - Custom routes, e.g., sendMessage, to handle specific actions.
Create an Integration with AWS Lambda
- For each route ($connect, $disconnect, custom routes), integrate a Lambda function to handle the respective logic.
- Use the Lambda function's handler to process:
  - $connect: Store the connection in DynamoDB.
  - $disconnect: Remove the connection from DynamoDB.
  - Custom routes: Process the message and forward it to SQS.
DynamoDB for Connection Management
- Create a DynamoDB table to store:
  - Connection ID (Primary Key).
  - Session ID or other metadata for grouping connections.
- This table allows tracking active WebSocket connections for broadcasting messages.
Configure SQS for Message Queue
- Use an SQS FIFO queue for guaranteed order and deduplication.
- Messages processed in Lambda (custom routes) are sent to SQS for downstream services.
IAM Roles and Permissions
- Assign an IAM role to the API Gateway to invoke the integrated Lambda functions.
- Grant Lambda permissions to read/write from DynamoDB and send messages to SQS.
Client Connection and Messaging
- Use WebSocket-compatible libraries (e.g., ws in Node.js or WebSocket API in browsers) to:
- Establish a WebSocket connection to the API Gateway endpoint.
- Send and receive messages using the WebSocket protocol.

Architecture of Websocket mechanism

WebSocket Client:
- Initiates WebSocket connection and communicates via send() and onmessage().
API Gateway (WebSocket API):
- Manages WebSocket connections and invokes Lambda functions for defined routes.
Route Integration (Lambda Functions):
Every route should have an integration. There are 3 types — Mock, HTTP and Lambda.
- $connect: Adds connection metadata to DynamoDB.
- $disconnect: Removes connection metadata from DynamoDB.
- $default route: selected when route cant be evaluated against message
- Custom Routes: Processes messages to invoke integration based on message content and forwards them to SQS.
DynamoDB:
- Maintains active connection records, including connectionId and associated metadata.
SQS FIFO Queue:
- Queues messages for downstream processing, ensuring delivery order and deduplication.
Downstream Services:
- Processes messages from SQS and performs actions like notifications, data updates, or storage.

Security

Authentication and Authorization

Custom Authorizer (Lambda Authorizer)
It can only be used for the $connect route.
- Create a Lambda Authorizer to validate custom tokens or headers sent during connection attempts.
- Example:
  - Validate a JWT token from an identity provider (e.g., Cognito, Auth0).
  - Check the token against allowed users or roles.
Amazon Cognito:
- Use Amazon Cognito for user authentication.
- Configure API Gateway to use Cognito to validate tokens in connection requests.
- Best suited for applications with user pools.

Secure WebSocket Connections

Always use the secure WebSocket protocol (wss://). API Gateway enforces HTTPS/TLS, ensuring encrypted communication.
Associate a custom domain with API Gateway WebSocket endpoint. We should AWS Certificate Manager (ACM) to manage SSL/TLS certificates.

IP Whitelisting and Blacklisting

IP Whitelisting and Blacklisting: We should Attach AWS WAF to API Gateway and Block/allow requests based on IP addresses or CIDR ranges. we should also use rate limit to protect from DDoS attack ### API Gateway Throttling
We can Set rate and burst limits on API Gateway routes to limit the number of connections per client.
We can create API keys and associate them with usage plan and then we Limit the number of allowed requests per API key

Environment-based Access Control:

We should always use distinct stages (e.g., dev, prod) and restrict connections to the production API through IP rules.

Tools to test

There are following tools which we can explore to test websocket.

Piesocket
Postman

S3 table & S3 Metadata table

Amit Kayal — Mon, 09 Dec 2024 18:26:23 +0000

Open table format and its architecture

OpenTable formats, such as Apache Iceberg, Apache Hudi, and Delta Lake, have gained popularity in the data analytics mainly because:

ACID Transactions: OpenTable formats (e.g., Apache Iceberg, Delta Lake) ensure reliable and consistent data updates, even with concurrent access.
Schema Evolution: They allow seamless updates to schemas without disrupting existing pipelines, simplifying data management. metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.
Optimized Queries: Partitioning and indexing enable faster queries by scanning only relevant data, improving performance and cost-efficiency.
Time Travel: Users can access historical versions of data for debugging, compliance, or analytics.
Interoperability: These formats integrate seamlessly with big data tools like Spark, Flink, and Presto, making them versatile and widely adopted.

Open file format

S3 table

Key Features

Amazon S3 Table is optimized for analytics workloads. It is designed to continuously enhance query performance and reduce storage costs for tabular data. This solution looks very promising if you are working with LakeHouse architecture. It’s a new type of bucket that organizes tables as sub-resources.
A new bucket type s3 table has been introduced to support this. As liked any other aws resoyrce, it has ARN, can take resource policy and as an unique feature it has dedicated endpoint.

S3 Tables are intended explicitly for storing data in a tabular format, such as daily purchase transactions, streaming sensor data, or ad impressions. This data is organized into columns and rows like a database table.
Table buckets support storing tables in the Apache Iceberg format. You can query these tables using standard SQL in query engines that support Iceberg.
Read/write allowed on datafiles and metadata files. Delete and update not allowed to save data integrity.
Compatible query engines include Amazon Athena, Amazon Redshift, and Apache Spark.
S3 Table automatically performs maintenance tasks like compaction and snapshot management to optimize your tables for querying, including removing unreferenced files.
S3 Table offers access management for both table and bucket
Fully managed apache icebarg tables in S3
It supports automatic compaction of underlying files to improve query performance and tune then further for better latency.

S3 Table buckets namespace

Namespace logically groups related s3 table together and thus allowing us to have greater control based on namespace of s3 tables. It helps us for following:

logical segmentation of data and multi tenancy
- supporting of multi tenancy by having separate namespace. Supports compliance with data isolation requirements in regulated industries.
- separate tables based on application, project etc
prevent naming conflicts
- Each namespace acts like a "container," allowing tables with the same name in different namespaces without conflicts.
Better Access Control
- Policies can grant or restrict access to specific namespaces, ensuring data security and compliance. It also reduces the risk of unauthorized access to unrelated tables in the same bucket.
Easy data management
- Makes our life easier to query, update, or delete related tables in bulk.
- Makes easy metadata management for tables grouped under a namespace.
Advanced workflows based on namespace
- It helps to simplify automation for data pipelines or real-time analytics applications.

S3 table opertaion & management

Table Operation
They are quite similar to CRUD operation.

list tables
create tables
Get table metadata location
Update table metadata location
Delete Table

Table Management

Put Table Policy
Put Table Bucket Policy
Put Table Maintenance Config
Put Table Bucket Maintenance Config

Policies related to S3 table operation

Allow access to create and use table buckets

Here Action Lists the specific actions the policy allows.

These actions are S3 Tables-specific:

s3tables:CreateTableBucket: Grants permission to create a table bucket in S3 Tables.
s3tables:PutTableBucketPolicy: Allows setting or updating the bucket policy for a table bucket.
s3tables:GetTableBucketPolicy: Allows retrieving the bucket policy associated with a table bucket.
s3tables:ListTableBuckets: Allows listing all table buckets within the specified scope.
s3tables:GetTableBucket: Grants permission to access the metadata of a specific table bucket.
Resource Defines the scope of the resources these actions can apply to.
"arn:aws:s3tables:region:account_id:bucket/*": Specifies all table buckets in the account (account_id) and region (region).
The * after bucket/ indicates that permissions apply to all buckets under this account and region.

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "AllowBucketActions for user",
        "Effect": "Allow",
        "Action": [
            "s3tables:CreateTableBucket",
            "s3tables:PutTableBucketPolicy",
            "s3tables:GetTableBucketPolicy",
            "s3tables:ListTableBuckets",
            "s3tables:GetTableBucket"
        ],
        "Resource": "arn:aws:s3tables:region:account_id:bucket/*"
    }]
}

Allow access to create and use tables in a table bucket

Here Action Lists the specific actions allowed by the policy, related to S3 Tables. Please note that The first policy focused on creating and managing table buckets and associated metadata, but it did not include granular operations like managing tables within namespaces. The first policy did not include actions such as creating tables, querying data, or updating metadata at the table level. These are the operations where namespaces become relevant.

s3tables:CreateTable: Allows creating new tables in the specified table bucket.
s3tables:PutTableData: Grants permission to write data to tables within the table bucket.
s3tables:GetTableData: Allows reading data from tables in the bucket.
s3tables:GetTableMetadataLocation: Allows retrieving metadata location information for a table.
s3tables:UpdateTableMetadataLocation: Grants permission to update the metadata location of a table.
s3tables:GetNamespace: Allows retrieving namespace information associated with the table bucket.
s3tables:CreateNamespace: Grants permission to create namespaces for organizing table data.

Resource section specifies

Grants permissions on the bucket named amzn-s3-demo-table-bucket
Grants permissions on all tables within the amzn-s3-demo-table-bucket

{
     "Version": "2012-10-17",
     "Statement": [ 
         {
             "Sid": "AllowBucketActions",
             "Effect": "Allow",
             "Action": [
                 "s3tables:CreateTable",
                 "s3tables:PutTableData",
                 "s3tables:GetTableData",
                 "s3tables:GetTableMetadataLocation",
                 "s3tables:UpdateTableMetadataLocation",
                 "s3tables:GetNamespace",
                 "s3tables:CreateNamespace"
             ],

             "Resource": [
               "arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket",
               "arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket/table/*"
            ]
         }
     ]
 }

Table bucket policy to allows read access to the namespace

This policy allows to read s3 tables from a namespace. Here Action Lists the specific actions allowed by the policy, related to S3 Tables.

s3tables:GetTableData: Allows reading data from tables in the bucket.
s3tables:GetTableMetadataLocation: Allows retrieving metadata location information for a table. The resource section allows all s3 tables under bucket amzn-s3-demo-table-bucket1 but then s3tables:namespace restrict to only hr related s3 tables.

{
     "Version": "2012-10-17",
     "Statement": [ 
         {
             "Effect": "Allow",
             "Action": [
             "Principal": {
               "AWS": "arn:aws:iam::123456789012:user/Jane"
             },
             "Action": [
                  "s3tables:GetTableData", 
                  "s3tables:GetTableMetadataLocation"
             ],
             "Resource":{ "arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket1/table/*”}
             "Condition": { 
                  "StringLike": { "s3tables:namespace": "hr" } 
             }
     ]
 }

S3 table automatic maintenance

It provides automated maintenance through configurations that help simplify table management, optimize performance, and reduce operational overhead.

Table Lifecycle Management
- we can add S3 Table configurations that includes lifecycle policies that automatically handle data expiration, transitions, or archival.
- automatic snapshot expiration can be configured easily.
Data Compaction
- S3 Tables automatically compact small files (often produced by incremental writes) into larger, optimized files. It helps to have faster query and reduce storage cost.
Schema Evolution
- Automated checks ensure compatibility between new and existing data.
Metadata Optimization
- Indexing of metadata for faster querying and retrieval of table details.

All these can be policy based configuration.

Policy for snapshot management

By configuring the maximumSnapshotAge, we can specify the retention period for table snapshots. The following example ensures S3 Table will automatically retain only the snapshots from the last 30 days

MinimumSnapshots: Ensures that at least one snapshot is always retained, regardless of age.
MaximumSnapshotAge: Specifies the maximum age (in hours) for snapshots to be retained.

aws s3tables put-table-maintenance-configuration \
    --table-arn arn:aws:s3tables:region:account_id:bucket/bucket_name/table/table_name \
    --maintenance-configuration '{
        "SnapshotManagement": {
            "MinimumSnapshots": 1,
            "MaximumSnapshotAge": 720
        }
    }

S3 Table Integration with AWS Analytics

S3 Tables integrate seamlessly with AWS analytics services to enable querying, processing and insight generation.

Amazon Athena - Run serverless SQL queries on S3 Tables

Use AWS Glue to create a Data Catalog for S3 Tables.
Query data directly using SQL in Athena.
Leverage table formats like Apache Iceberg or Parquet for optimized performance.

AWS Glue - Automate ETL processes for S3 Tables

Use Glue Crawlers to discover table metadata.
Create ETL jobs to transform and load data into S3 Tables or other destinations.

S3 Metadata table

It includes system metadata including object tags and user defined metadata
stored into s3 table
generated in near real time during data creation so that it can be used in mins during query

Use case for S3 metadata table

Real-Time Analytics
- efficient query execution on metadata to identify relevant data partitions.
Machine Learning Pipelines
- metadata tables to filter, select, and partition data for model training.
Governance and Compliance
- Track data retention and enforce lifecycle policies via metadata.
Multi-Tenant Data Applications
- Use namespaces within metadata tables to logically isolate tenant data.
Data Cataloging and Discovery
- Use metadata queries to identify datasets matching specific criteria.

Here is the sample python based function which uses metadata table query from athena.

def query_metadata_table(criteria):

    query = f"""
        SELECT *
        FROM {DATABASE}.{TABLE}
        WHERE {criteria}
    """

    print(f"Running query: {query}")

    # Start Athena query
    response = athena_client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={'Database': DATABASE},
        ResultConfiguration={'OutputLocation': S3_OUTPUT}
    )

    query_execution_id = response['QueryExecutionId']

    # Wait for query completion
    print("Waiting for query to complete...")
    while True:
        status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
        state = status['QueryExecution']['Status']['State']
        if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
            break
        time.sleep(2)

    if state != 'SUCCEEDED':
        raise Exception(f"Query failed with state: {state}")

    # Retrieve results
    results = athena_client.get_query_results(QueryExecutionId=query_execution_id)
    datasets = []
    for row in results['ResultSet']['Rows'][1:]:  # Skip the header row
        datasets.append([col['VarCharValue'] for col in row['Data']])

    print(f"Query returned {len(datasets)} datasets matching the criteria.")
    return datasets

Brief Notes on AWS CodeDeploy

Amit Kayal — Thu, 21 Mar 2024 19:04:04 +0000

Service that automates code deployments to any instance, including Amazon EC2 instances and instances running on-premises.

Supported Platforms/Deployment Types:

EC2/On-Premises: In-Place or Blue/Green Deployments
- Describes instances of physical servers that can be Amazon EC2 cloud instances, on-premises servers, or both. Applications created using the EC2/On-Premises compute platform can be composed of executable files, configuration files, images, and more. o - - Deployments that use the EC2/On-Premises compute platform manage the way in which traffic is directed to instances by using an in-place or blue/green deployment type.
AWS Lambda: Canary, Linear, All-At-Once Deployments
- Applications created using the AWS Lambda compute platform can manage the way in which traffic is directed to the updated Lambda function versions during a deployment by choosing a canary, linear, or all-at-once configuration.
Amazon ECS: Blue/Green Deployment
- Used to deploy an Amazon ECS containerized application as a task set.
- CodeDeploy performs a blue/green deployment by installing an updated version of the containerized application as a new replacement task set. CodeDeploy reroutes production traffic from the original application, or task set, to the replacement task set. The original task set is terminated after a successful deployment.

Deployment approach for EC2

Deploys a revision to a set of instances.
Deploys a new revision that consists of an application and AppSpec file. The AppSpec specifies how to deploy the application to the instances in a deployment group.

Deployment approach for Lambda

Deploys a new version of a serverless Lambda function on a high-availability compute infrastructure.
Shifts production traffic from one version of a Lambda function to a new version of the same function. The AppSpec file specifies which Lambda function version to deploy.

Deployment approach for ECS

Deploys an updated version of an Amazon ECS containerized application as a new, replacement task set. CodeDeploy reroutes production traffic from the task set with the original version to the new replacement task set with the updated version. When the deployment completes, the original task set is terminated.

App Spec File

The application specification file (AppSpec file) is a YAML-formatted or JSON-formatted file used by CodeDeploy to manage a deployment. Note: the name of the AppSpec file for an EC2/On-Premises deployment must be appspec.yml. The name of the AppSpec file for an Amazon ECS or AWS Lambda deployment must be appspec.yml.

For ECS

The container and port in replacement task set where your Application Load Balancer or Network Load Balancer reroutes traffic during a deployment. This is specified with the LoadBalancerInfo instruction in the AppSpec file.
Amazon ECS task definition file. This is specified with its ARN in the TaskDefinition instruction in the AppSpec file.

For Lambda

Lambda function version to deploy.
Lambda functions to use as validation tests.

For EC2

Which lifecycle event hooks to run in response to deployment lifecycle events.

Bedrock Agent & Tools - Tracing Best practises

Amit Kayal — Wed, 20 Mar 2024 17:58:52 +0000

I understand most of bedrock agent userss will have a use case where you have implemented multiple Lambda functions with a Bedrock Agent to perform different tasks and are looking for guidance in Debugging the API calls and responses from the Agent and lambda functions.

Here are some of the approaches that we have been using and found quite effective to track and trace agents and usage of their tools

Enable Tracing for the Agent: When invoking the agent, set the debug parameter to true. This will enable detailed tracing for the agent's execution, including the tools (Lambda functions) invoked and their responses. The trace will be printed to the console or returned as part of the agent's response, depending on how you invoke the agent. [1] Example (Python):

python result = agent.run(query, debug=True)

Log Within Lambda Functions: Within each of your Lambda functions (tools), add logging statements to capture relevant information and events. You can use AWS Lambda's built-in logging capabilities or integrate with a centralized logging service like Amazon CloudWatch Logs. [2] Example (Python):

python import logging logger = logging.getLogger(__name__) def lambda_handler(event, context): http://logger.info (f"Received event: {event}") # Your Lambda function's logic here http:// logger.info (f"Returning result: {result}") return result

Correlate Logs Using Request IDs or Tracing IDs: To correlate logs across multiple Lambda functions and the agent, you can use request IDs or tracing IDs. Pass a unique ID as part of the event or context to your Lambda functions and include it in your log statements. This will allow you to trace the flow of events across different components of your system.

import logging import uuid def lambda_handler(event, context): request_id = event.get("request_id", str(uuid.uuid4())) logger = logging.getLogger(__name__) logger = logging.LoggerAdapter(logger, {"request_id": request_id}) logger.info(f"Received event: {event}") logger.info(f"Returning result: {result}") return result

Use AWS X-Ray for Distributed Tracing: AWS X-Ray is a service that can help you analyze and debug distributed applications, including Lambda functions. By integrating X-Ray with your Bedrock application, you can trace requests as they travel through your Lambda functions and gain insights into their performance and potential issues. [3] - Enable X-Ray tracing for your Lambda functions by adding the necessary configuration. - Instrument your Lambda functions with X-Ray tracing code to capture relevant information and events. - Use the X-Ray console or integrate with other monitoring tools to analyze the traces and identify potential bottlenecks or issues.
Implement Advanced prompts : By using advanced prompts, you can enhance your agent's accuracy through modifying these prompt templates to provide detailed configurations. You can also provide hand-curated examples for few-shot prompting, in which you improve model performance by providing labeled examples for a specific task. [4] By combining the built-in tracing mechanism, custom logging within your Lambda functions, and distributed tracing with AWS X-Ray, you can gain better visibility into the API calls, events, and interactions happening within your Bedrock agent and its associated tools. This can help you debug issues more effectively and trace errors back to their source across multiple Lambda functions.
Reference

AWS DEV OPS Professional Exam short notes

Amit Kayal — Sun, 17 Mar 2024 05:55:58 +0000

Last few weeks I have been preparing for this exam and have summarized below key notes for further quick reference.

Key Notes

You can use CloudWatch Logs to monitor applications and systems using log data. For example, CloudWatch Logs can track the number of errors that occur in your application logs and send you a notification whenever the rate of errors exceeds a threshold you specify. CloudWatch Logs uses your log data for monitoring; so, no code changes are required. For more information on Cloudwatch logs , please refer to the below link: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html The correct answer is: Install the CloudWatch Logs Agent on your AMI, and configure CloudWatch Logs Agent to stream your logs.
You can add another layer of protection by enabling MFA Delete on a versioned bucket. Once you do so, you must provide your AWS account’s access keys and a valid code from the account’s MFA device in order to permanently delete an object version or suspend or reactivate versioning on the bucket. For more information on MFA please refer to the below link: https://aws.amazon.com/blogs/security/securing-access-to-aws-using-mfa-part-3/ IAM roles are designed so that your applications can securely make API requests from your instances, without requiring you to manage the security credentials that the applications use. Instead of creating and distributing your AWS credentials, you can delegate permission to make API requests using IAM roles For more information on Roles for EC2 please refer to the below link: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
As your infrastructure grows, common patterns can emerge in which you declare the same components in each of your templates. You can separate out these common components and create dedicated templates for them. That way, you can mix and match different templates but use nested stacks to create a single, unified stack. Nested stacks are stacks that create other stacks. To create nested stacks, use the AWS::CloudFormation::Stackresource in your template to reference other templates. For more information on best practices for Cloudformation please refer to the below link: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/best-practices.html The correct answer is: Separate the AWS CloudFormation template into a nested structure that has individual templates for the resources that are to be governed by different departments, and use the outputs from the networking and security stacks for the application template that you control.
You can use Amazon CloudWatch Logs to monitor, store, and access your log files from Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS CloudTrail, and other sources. You can then retrieve the associated log data from CloudWatch Logs. For more information on Cloudwatch logs please refer to the below link: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html You can the use Kinesis to process those logs For more information on Amazon Kinesis please refer to the below link: http://docs.aws.amazon.com/streams/latest/dev/introduction.html The correct answers are: Using AWS CloudFormation, create a CloudWatch Logs LogGroup and send the operating system and application logs of interest using the CloudWatch Logs Agent., Using configuration management, set up remote logging to send events to Amazon Kinesis and insert these into Amazon CloudSearch or Amazon Redshift, depending on available analytic tools.
IAM roles are designed so that your applications can securely make API requests from your instances, without requiring you to manage the security credentials that the applications use. Instead of creating and distributing your AWS credentials For more information on IAM Roles please refer to the below link: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
The AWS Security Token Service (STS) is a web service that enables you to request temporary, limited-privilege credentials for AWS Identity and Access Management (IAM) users or for users that you authenticate (federated users). The token can then be used to grant access to the objects in S3. You can then provides access to the objects based on the key values generated via the user id
As your infrastructure grows, common patterns can emerge in which you declare the same components in each of your templates. You can separate out these common components and create dedicated templates for them. That way, you can mix and match different templates but use nested stacks to create a single, unified stack. Nested stacks are stacks that create other stacks. To create nested stacks, use the AWS::CloudFormation::Stackresource in your template to reference other templates. For more information on Cloudformation best practises please refer to the below link: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/best-practices.html The correct answer is: Create separate templates based on functionality, create nested stacks with CloudFormation.
The default autosclae termination policy is designed to help ensure that your network architecture spans Availability Zones evenly. When using the default termination policy, Auto Scaling selects an instance to terminate as follows: Auto Scaling determines whether there are instances in multiple Availability Zones. If so, it selects the Availability Zone with the most instances and at least one instance that is not protected from scale in. If there is more than one Availability Zone with this number of instances, Auto Scaling selects the Availability Zone with the instances that use the oldest launch configuration. For more information on Autoscaling instance termination please refer to the below link: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html The correct answer is: Auto Scaling will select the AZ with 4 EC2 instances and terminate an instance.
Amazon RDS Read Replicas provide enhanced performance and durability for database (DB) instances. This replication feature makes it easy to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. You can create one or more replicas of a given source DB Instance and serve high-volume application read traffic from multiple copies of your data, thereby increasing aggregate read throughput. Sharding is a common concept to split data across multiple tables in a database. Shard your data set among multiple Amazon RDS DB instances.Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud. The service improves the performance of web applications by allowing you to retrieve information from fast, managed, in-memory data stores, instead of relying entirely on slower disk-based databases.
Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.
Elastic Beanstalk simplifies this process by managing the Amazon SQS queue and running a daemon process on each instance that reads from the queue for you. When the daemon pulls an item from the queue, it sends an HTTP POST request locally to http://localhost/ with the contents of the queue message in the body. All that your application needs to do is perform the long-running task in response to the POST. For more information Elastic Beanstalk managing worker environments, please visit the below URL: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
If you suspend AddToLoadBalancer, Auto Scaling launches the instances but does not add them to the load balancer or target group. If you resume the AddToLoadBalancer process, Auto Scaling resumes adding instances to the load balancer or target group when they are launched. However, Auto Scaling does not add the instances that were launched while this process was suspended. You must register those instances manually. For more information on the Suspension and Resumption process, please visit the below URL: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-suspend-resume-processes.html
You can use the container_commands key of elastic beanstalk to execute commands that affect your application source code. Container commands run after the application and web server have been set up and the application version archive has been extracted, but before the application version is deployed. Non-container commands and other customization operations are performed prior to the application source code being extracted. You can use leader_only to only run the command on a single instance, or configure a test to only run the command when a test command evaluates to true. Leader-only container commands are only executed during environment creation and deployments, while other commands and server customization operations are performed every time an instance is provisioned or updated. Leader-only container commands are not executed due to launch configuration changes, such as a change in the AMI Id or instance type. For more information on customizing containers, please visit the below URL: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html The correct answer is: Use a “Container command” within an Elastic Beanstalk configuration file to execute the script, ensuring that the “leader only” flag is set to true.
A Dockerrun.aws.json file is an Elastic Beanstalk–specific JSON file that describes how to deploy a set of Docker containers as an Elastic Beanstalk application. You can use aDockerrun.aws.json file for a multicontainer Docker environment. Dockerrun.aws.json describes the containers to deploy to each container instance in the environment as well as the data volumes to create on the host instance for the containers to mount. http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_docker_v2config.html
Elastic Beanstalk supports the deployment of web applications from Docker containers. With Docker containers, you can define your own runtime environment. You can choose your own platform, programming language, and any application dependencies (such as package managers or tools), that aren’t supported by other platforms. Docker containers are self-contained and include all the configuration information and software your web application requires to run.
When you see Amazon Kinesis as an option, this becomes the ideal option to process data in real time. Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. With Amazon Kinesis, you can ingest real-time data such as application logs, website clickstreams, IoT telemetry data, and more into your databases, data lakes and data warehouses, or build your own real-time applications using this data. For more information on Amazon Kinesis, please visit the below URL: https://aws.amazon.com/kinesis
You can use CloudWatch Logs to monitor applications and systems using log data CloudWatch Logs uses your log data for monitoring; so, no code changes are required. For example, you can monitor application logs for specific literal terms (such as “NullReferenceException”) or count the number of occurrences of a literal term at a particular position in log data (such as “404” status codes in an Apache access log). When the term you are searching for is found, CloudWatch Logs reports the data to a CloudWatch metric that you specify. Log data is encrypted while in transit and while it is at rest For more information on Cloudwatch logs please refer to the below link: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html Amazon CloudWatch uses Amazon SNS to send email. First, create and subscribe to an SNS topic. When you create a CloudWatch alarm, you can add this SNS topic to send an email notification when the alarm changes state. For more information on SNS and Cloudwatch logs please refer to the below link: The correct answers are: Install a CloudWatch Logs Agent on your servers to stream web application logs to CloudWatch., Create a CloudWatch Logs group and define metric filters that capture 500 Internal Server Errors. Set a CloudWatch alarm on that metric., Use Amazon Simple Notification Service to notify an on-call engineer when a CloudWatch alarm is triggered
When you provision an Amazon EC2 instance in an AWS CloudFormation stack, you might specify additional actions to configure the instance, such as install software packages or bootstrap applications. Normally, CloudFormation proceeds with stack creation after the instance has been successfully created. However, you can use a CreationPolicy so that CloudFormation proceeds with stack creation only after your configuration actions are done. That way you’ll know your applications are ready to go after stack creation succeeds.
Auto Scaling periodically performs health checks on the instances in your Auto Scaling group and identifies any instances that are unhealthy. You can configure Auto Scaling to determine the health status of an instance using Amazon EC2 status checks, Elastic Load Balancing health checks, or custom health checks By default, Auto Scaling health checks use the results of the EC2 status checks to determine the health status of an instance. Auto Scaling marks an instance as unhealthy if its instance fails one or more of the status checks. For more information monitoring in Autoscaling , please visit the below URL: http://docs.aws.amazon.com/autoscaling/latest/userguide/as-monitoring-features.html
You need to have a custom health check which will evaluate the application functionality. Its not enough using the normal health checks. If the application functionality does not work and if you don’t have custom health checks , the instances will still be deemed as healthy. If you have custom health checks, you can send the information from your health checks to Auto Scaling so that Auto Scaling can use this information. For example, if you determine that an instance is not functioning as expected, you can set the health status of the instance to Unhealthy. The next time that Auto Scaling performs a health check on the instance, it will determine that the instance is unhealthy and then launch a replacement instance For more information on Autoscaling health checks , please refer to the below document link: from AWS http://docs.aws.amazon.com/autoscaling/latest/userguide/healthcheck.html
A blue group carries the production load while a green group is staged and deployed with the new code. When it’s time to deploy, you simply attach the green group to the existing load balancer to introduce traffic to the new environment. For HTTP/HTTPS listeners, the load balancer favors the green Auto Scaling group because it uses a least outstanding requests routing algorithm As you scale up the green Auto Scaling group, you can take blue Auto Scaling group instances out of service by either terminating them or putting them in Standby state, For more information on Blue Green Deployments , please refer to the below document link: from AWS https://d0.awsstatic.com/whitepapers/AWS_Blue_Green_Deployments.pdf
Ensure first that the cloudformation template is updated with the new instance type. The AWS::AutoScaling::AutoScalingGroup resource supports an UpdatePolicy attribute. This is used to define how an Auto Scaling group resource is updated when an update to the CloudFormation stack occurs. A common approach to updating an Auto Scaling group is to perform a rolling update, which is done by specifying the AutoScalingRollingUpdate policy. This retains the same Auto Scaling group and replaces old instances with new ones, according to the parameters specified
With web identity federation, you don’t need to create custom sign-in code or manage your own user identities. Instead, users of your app can sign in using a well-known identity provider (IdP) —such as Login with Amazon, Facebook, Google, or any other OpenID Connect (OIDC)-compatible IdP, receive an authentication token, and then exchange that token for temporary security credentials in AWS that map to an IAM role with permissions to use the resources in your AWS account. Using an IdP helps you keep your AWS account secure, because you don’t have to embed and distribute long-term security credentials with your application.
The optional Conditions section includes statements that define when a resource is created or when a property is defined. For example, you can compare whether a value is equal to another value. Based on the result of that condition, you can conditionally create resources. If you have multiple conditions, separate them with commas. You might use conditions when you want to reuse a template that can create resources in different contexts, such as a test environment versus a production environment. In your template, you can add an EnvironmentType input parameter, which accepts either prod or test as inputs. For the production environment, you might include Amazon EC2 instances with certain capabilities; however, for the test environment, you want to use reduced capabilities to save money. With conditions, you can define which resources are created and how they’re configured for each environment type. For more information on Cloudformation conditions please refer to the below link: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/conditions-section-structure.html
Elastic Beanstalk already has the facility to manage various versions and you don’t need to use S3 separately for this.AWS beanstalk is the perfect solution for developers to maintain application versions. With AWS Elastic Beanstalk, you can quickly deploy and manage applications in the AWS Cloud without worrying about the infrastructure that runs those applications. AWS Elastic Beanstalk reduces management complexity without restricting choice or control. You simply upload your application, and AWS Elastic Beanstalk automatically handles the details of capacity provisioning, load balancing, scaling, and application health monitoring.
The first step in using Elastic Beanstalk is to create an application, which represents your web application in AWS. In Elastic Beanstalk an application serves as a container for the environments that run your web app, and versions of your web app’s source code, saved configurations, logs and other artifacts that you create while using Elastic Beanstalk. For more information on Applications, please refer to the below link: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/applications.html Deploying a new version of your application to an environment is typically a fairly quick process. The new source bundle is deployed to an instance and extracted, and then the web container or application server picks up the new version and restarts if necessary. During deployment, your application might still become unavailable to users for a few seconds. You can prevent this by configuring your environment to use rolling deployments to deploy the new version to instances in batches. For more information on deployment, please refer to the below link: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.deploy-existing-version.html
Weighted routing lets you associate multiple resources with a single domain name (example.com) or subdomain name (acme.example.com) and choose how much traffic is routed to each resource. This can be useful for a variety of purposes, including load balancing and testing new versions of software. For more information on the Routing policy please refer to the below link: http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html
Amazon Elasticsearch Service makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more. Amazon Elasticsearch Service is a fully managed service that delivers Elasticsearch’s easy-to-use APIs and real-time capabilities along with the availability, scalability, and security required by production workloads. The service offers built-in integrations with Kibana, Logstash, and AWS services including Amazon Kinesis Firehose, AWS Lambda, and Amazon CloudWatch so that you can go from raw data to actionable insights quickly.
You can use CloudWatch Logs to monitor applications and systems using log data. For example, CloudWatch Logs can track the number of errors that occur in your application logs and send you a notification whenever the rate of errors exceeds a threshold you specify. CloudWatch Logs uses your log data for monitoring; so, no code changes are required. For example, you can monitor application logs for specific literal terms (such as “NullReferenceException”) or count the number of occurrences of a literal term at a particular position in log data (such as “404” status codes in an Apache access log). When the term you are searching for is found, CloudWatch Logs reports the data to a CloudWatch metric that you specify. For more information on Cloudwatch Logs please refer to the below link: http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html Amazon CloudWatch uses Amazon SNS to send email. First, create and subscribe to an SNS topic. When you create a CloudWatch alarm, you can add this SNS topic to send an email notification when the alarm changes state. For more information on Cloudwatch and SNS please refer to the below link: http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/US_SetupSNS.html
AWS OpsWorks is a configuration management service that uses Chef, an automation platform that treats server configurations as code. OpsWorks uses Chef to automate how servers are configured, deployed, and managed across your Amazon Elastic Compute Cloud (Amazon EC2) instances or on-premises compute environments. OpsWorks has two offerings, AWS Opsworks for Chef Automate, and AWS OpsWorks Stacks. For more information on Opswork and SNS please refer to the below link: https://aws.amazon.com/opsworks/
You can use Kinesis Streams for rapid and continuous data intake and aggregation. The type of data used includes IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data. Because the response time for the data intake and processing is in real time, the processing is typically lightweight. The following are typical scenarios for using Kinesis Streams: Accelerated log and data feed intake and processing – You can have producers push data directly into a stream. For example, push system and application logs and they’ll be available for processing in seconds. This prevents the log data from being lost if the front end or application server fails. Kinesis Streams provides accelerated data feed intake because you don’t batch the data on the servers before you submit it for intake. Real-time metrics and reporting – You can use data collected into Kinesis Streams for simple data analysis and reporting in real time. For example, your data-processing application can work on metrics and reporting for system and application logs as the data is streaming in, rather than wait to receive batches of data. For more information on Amazon Kinesis and SNS please refer to the below link: http://docs.aws.amazon.com/streams/latest/dev/introduction.html
With Elastic Beanstalk, you can quickly deploy and manage applications in the AWS Cloud without worrying about the infrastructure that runs those applications. AWS Elastic Beanstalk reduces management complexity without restricting choice or control. You simply upload your application, and Elastic Beanstalk automatically handles the details of capacity provisioning, load balancing, scaling, and application health monitoring For more information on Elastic beanstalk please refer to the below link: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html
You can use intrinsic functions, such as Fn::If, Fn::Equals, and Fn::Not, to conditionally create stack resources. These conditions are evaluated based on input parameters that you declare when you create or update a stack. After you define all your conditions, you can associate them with resources or resource properties in the Resources and Outputs sections of a template.
Amazon RDS Multi-AZ deployments provide enhanced availability and durability for Database (DB) Instances, making them a natural fit for production database workloads. When you provision a Multi-AZ DB Instance, Amazon RDS automatically creates a primary DB Instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ). Each AZ runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. In case of an infrastructure failure, Amazon RDS performs an automatic failover to the standby (or to a read replica in the case of Amazon Aurora), so that you can resume database operations as soon
You can use AWS CloudTrail to get a history of AWS API calls and related events for your account. This history includes calls made with the AWS Management Console, AWS Command Line Interface, AWS SDKs, and other AWS services. For more information on Cloudtrail, please visit the below URL: http://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes in Amazon Web Services (AWS) resources. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams. CloudWatch Events becomes aware of operational changes as they occur. CloudWatch Events responds to these operational changes and takes corrective action as necessary, by sending messages to respond to the environment, activating functions, making changes, and capturing state information
By default, all AWS accounts are limited to 5 Elastic IP addresses per region, because public (IPv4) Internet addresses are a scarce public resource. We strongly encourage you to use an Elastic IP address primarily for the ability to remap the address to another instance in the case of instance failure, and to use DNS hostnames for all other inter-node communication
You can manage Amazon SQS messages with Amazon S3. This is especially useful for storing and consuming messages with a message size of up to 2 GB. To manage Amazon SQS messages with Amazon S3, use the Amazon SQS Extended Client Library for Java. Specifically, you use this library to: Specify whether messages are always stored in Amazon S3 or only when a message’s size exceeds 256 KB. Send a message that references a single message object stored in an Amazon S3 bucket. Get the corresponding message object from an Amazon S3 bucket. Delete the corresponding message object from an Amazon S3 bucket. For more information on processing large messages for SQS, please visit the below URL: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-s3-messages.html
AWS CloudFormation provisions and configures resources by making calls to the AWS services that are described in your template. After all the resources have been created, AWS CloudFormation reports that your stack has been created. You can then start using the resources in your stack. If stack creation fails, AWS CloudFormation rolls back your changes by deleting the resources that it created. The below snapshot from Cloudformation shows what happens when there is an error in the stack creation. For more information on how CloudFormation works , please refer to the below link: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-whatis-howdoesitwork.html
Because Elastic Beanstalk performs an in-place update when you update your application versions, your application may become unavailable to users for a short period of time. It is possible to avoid this downtime by performing a blue/green deployment, where you deploy the new version to a separate environment, and then swap CNAMEs of the two environments to redirect traffic to the new version instantly. Blue/green deployments require that your environment runs independently of your production database, if your application uses one. If your environment has an Amazon RDS DB instance attached to it, the data will not transfer over to your second environment, and will be lost if you terminate the original environment. For more information on Blue Green deployments with Elastic beanstalk , please refer to the below link: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html
Amazon RDS Read Replicas provide enhanced performance and durability for database (DB) instances. This replication feature makes it easy to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. You can create one or more replicas of a given source DB Instance and serve high-volume application read traffic from multiple copies of your data, thereby increasing aggregate read throughput. Read replicas can also be promoted when needed to become standalone DB instances.
Amazon Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.
If you use SSL termination, your servers will always get non-secure connections and will never know whether users used a more secure channel or not. If you are using Elastic beanstalk to configure the ELB, you can use the below article to ensure end to end encryption. http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/configuring-https-endtoend.html
Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security. For more information on Kinesis firehose, please visit the below URL: https://aws.amazon.com/kinesis/firehose/
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.
Use Cloudfront distribution for distributing the heavy reads for your application. You can create a zone apex record to point to the Cloudfront distribution. You can control how long your objects stay in a CloudFront cache before CloudFront forwards another request to your origin. Reducing the duration allows you to serve dynamic content. Increasing the duration means your users get better performance because your objects are more likely to be served directly from the edge cache. A longer duration also reduces the load on your origin.
Amazon EBS encryption offers you a simple encryption solution for your EBS volumes without the need for you to build, maintain, and secure your own key management infrastructure. When you create an encrypted EBS volume and attach it to a supported instance type, the following types of data are encrypted: Data at rest inside the volume All data moving between the volume and the instance All snapshots created from the volume Snapshots that are taken from encrypted volumes are automatically encrypted. Volumes that are created from encrypted snapshots are also automatically encrypted.
A tag is a label that you or AWS assigns to an AWS resource. Each tag consists of a key and a value. A key can have more than one value. You can use tags to organize your resources, and cost allocation tags to track your AWS costs on a detailed level. After you activate cost allocation tags, AWS uses the cost allocation tags to organize your resource costs on your cost allocation report, to make it easier for you to categorize and track your AWS costs. AWS provides two types of cost allocation tags, an AWS-generated tag and user-defined tags. AWS defines, creates, and applies the AWS-generated tag for you, and you define, create, and apply user-defined tags. You must activate both types of tags separately before they can appear in Cost Explorer or on a cost allocation report.
You can monitor the progress of a stack update by viewing the stack’s events. The console’s Events tab displays each major step in the creation and update of the stack sorted by the time of each event with latest events on top. The start of the stack update process is marked with an UPDATE_IN_PROGRESS event for the stack For more information on Monitoring your stack, please visit the below URL: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-monitor-stack.html
A placement group is a logical grouping of instances within a single Availability Zone. Placement groups are recommended for applications that benefit from low network latency, high network throughput, or both. To provide the lowest latency, and the highest packet-per-second network performance for your placement group, choose an instance type that supports enhanced networking. For more information on Placement Groups, please visit the below URL: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html
AWS CloudTrail is an AWS service that helps you enable governance, compliance, and operational and risk auditing of your AWS account. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail. Events include actions taken in the AWS Management Console, AWS Command Line Interface, and AWS SDKs and APIs. Visibility into your AWS account activity is a key aspect of security and operational best practices. You can use CloudTrail to view, search, download, archive, analyze, and respond to account activity across your AWS infrastructure. You can identify who or what took which action, what resources were acted upon, when the event occurred, and other details to help you analyze and respond to activity in your AWS account. For more information on Cloudtrail, please visit the below URL: http://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html
Custom resources enable you to write custom provisioning logic in templates that AWS CloudFormation runs anytime you create, update (if you changed the custom resource), or delete stacks. For example, you might want to include resources that aren’t available as AWS CloudFormation resource types. You can include those resources by using custom resources. That way you can still manage all your related resources in a single stack. Use the AWS::CloudFormation::CustomResource or Custom::String resource type to define custom resources in your templates. Custom resources require one property: the service token, which specifies where AWS CloudFormation sends requests to, such as an Amazon SNS topic.
Failover routing lets you route traffic to a resource when the resource is healthy or to a different resource when the first resource is unhealthy. The primary and secondary resource record sets can route traffic to anything from an Amazon S3 bucket that is configured as a website to a complex tree of records. For more information on Route53 Failover Routing, please visit the below URL: http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html
Deployment Types:
- Single Target Deployment - small dev projects, legacy or non-HA infrastructure; outage occurs in case of failure, testing opportunity is limited.
- All-at-Once Deployment - deployment happens on multiple targets, requires Orchestration tools, suitable for non critical apps in 5-10 range.
- Minimum in-service Deployment - keeps min in-service targets and deploy in multiple stages, suitable for large environments, allow automated testing, no downtime
- Rolling Deployments - x targets per stage, happens in multiple stages, after completion of stage 1, next stage begins, orchestration and health check required, can be least efficient if x is smaller, allow automated testing, no downtime if x is not large to impact application, can be paused, allowing multi-version testing.
- Blue Green Deployment - Deploy to seperate Green environment, update the code on Green, extra cost due to duplicate env during deployment, Deployment is rapid, cutover and migration is clean(DNS Change), Rollback easy(DNS regression), can be fully automates using CFN etc. Binary, No Traffic Split, not used to feature test
- A/B Testing - distribution traffic between blue/green, allows gradual performance/stability/health analysis, allows new feature testing, rollback is quick, end goal of A/B testing is not migration, Uses Route 53 for DNS resolution, 2 records one pointing A, other pointing B, weighted/round robin.
Intrinsic & Conditional Functions
- Intrinsic Fn - inbuilt function provided by AWS to help manage, reference, and condtionally act upon resources, situation & inputs to a stack.
- Fn::Base64 - Base64 encoding for User Data
- Fn::FindInMap - Mapping lookup
- Fn::GetAtt - Advanced reference look up
- Fn::GetAZs - retrieve list of AZs in a region
- Fn::Join - construct complex strings; concatenate strings
- Fn::Select - value selection from list (0, 1)
- Ref - default value of resource
- Conditional Functions - Fn::And, Fn::Equals, Fn::If, Fn::Not, Fn::Or
CFN Resource Deletion Policies
- A policy/setting which is associated with each resource in a template; A way to control what happens to each resource when a stack is deleted.
- Policy value - Delete (Default), Retain, Snapshot
- Delete - Useful for testing environment, CI/CD/QA workflows,
- Presales, Short Lifecycle/Immutable env.
- Retain - live beyond lifcycle of stack; Windows Server Platform (AD), Servers with state, SQL, Exchange, File Servers,
- Non immutable architectures.
- Snapshot - restricted policy type only available for EBS volumes; takes snapshot before deleting for recovering data.
Immutable Architecture - Replace infra instead of upgrading or repairing faulty components, treat servers as unchangeable objects, don't diagnose and fix, throw away and re-create, Nothing bootstraped except AMI.
CFN Stack updates
- stack policy is checked, updates can be prevented; absence of
- stack policy allow all updates; stack policy cannot be deleted once applied. Once stack policy applied ALL objects are protected, Update is denied; to remove default DENY, explicit allow is required; can be applied to a single resource(id)/Wild card/NotResource; Has Principal and Action; Condition element (resource type) can also be used.
Stack updates: 4 Types - Update with No Interrupion, Some Interruption, Replacement, Delete

Sagemaker Model deployment and Integration

Amit Kayal — Wed, 18 Jan 2023 07:29:11 +0000

Sagemaker Model deployment and Integration

[TOC]

AWS Feature store

SageMaker Feature Store is a purpose-built solution for ML feature management. It helps data science teams reuse ML features across teams and models, serve features for model predictions at scale with low latency, and train and deploy new models more quickly and effectively.

Refer the notebook https://github.com/aws-samples/ml-lineage-helper/blob/main/examples/example.ipynb for more details.

Why is feature lineage important?

Imagine trying to manually track all of this for a large team, multiple teams, or even multiple business units. Lineage tracking and querying helps make this more manageable and helps organizations move to ML at scale. The following are four examples of how feature lineage helps scale the ML process:

Build confidence for reuse of existing features
Avoid reinventing features that are based on the same raw data as existing features
Troubleshoot and audit models and model predictions
Manage features proactively

AWS ML Lens and built-in models

Deployment Options

ML inference can be done in real time on individual records, such as with a REST API endpoint. Inference can also be done in batch mode as a processing job on a large dataset. While both approaches push data through a model, each has its own target goal when running inference at scale.

* Real Time*	Micro Batch	Batch
Execution Mode	Synchronous	Synchronous/Asynchronous	Asynchronous
Prediction Latency	Subsecond	Seconds to minutes	Indefinite
Data Bounds	Unbounded/stream	Bounded	Bounded
Execution Frequency	Variable	Variable	Variable/fixed
Invocation Mode	Continuous stream/API calls	Event-based	Event-based/scheduled
Examples	Real-time REST API endpoint	Data analyst running a SQL UDF	Scheduled inference job

Realtime deployment

Sagemaker real-time deployment has the following approach. Key point here is that we can have our inference pipeline coupled with autoscale.

Here are different ways, we can deploy real-time endpoint by sagemaker. You can see here multiple options from own model, own container to prebuilt container.

With sagemaker, prebuilt container and its own inference script, we can use this as shared below.

Quite a lot of time, we add our own inference script and this is quite simple as shown below.

It is not rare to have our own container and own trained model along with inference script. The architecture does not change for that and we still follow same architecture as shared below.

Autoscale

we can set autoscale policy for sagemaker endpoint to scale up and scale down automatically.

We have to set autoscale policy setup for endpoint. You can see here that ServiceNamespace is set to sgaemaker and resourceId is set to Endpoint name.

Multi Modal endpoint

SageMaker multi-model endpoints work with several frameworks, such as TensorFlow, PyTorch, MXNet, and sklearn, and you can build your own container with a multi-model server. Multi-model endpoints are also supported natively in the following popular SageMaker built-in algorithms: XGBoost, Linear Learner, Random Cut Forest (RCF), and K-Nearest Neighbors (KNN).

Refer the notebook https://github.com/aws-samples/sagemaker-multi-model-endpoint-tensorflow-computer-vision/blob/main/multi-model-endpoint-tensorflow-cv.ipynb to understand how we can deploy this/. Refer the blog https://aws.amazon.com/blogs/machine-learning/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints/

All of the models that are hosted on a multi-modal endpoint must share the same serving container image.
Multi-model endpoints are an option that can improve endpoint utilization when your models are of similar size and share the same container image and have similar invocation latency requirements.
all the model needs to share same S3 bucket to host their weights

Cost advantages

This diagram demonstrates running 10 models on a multi-model endpoint versus using 10 separate endpoints. This results in savings of $3,000 per month, as shown in the following figure: Multi-model endpoints can easily scale to hundreds or thousands of models.

How to use?

To create a multi-model endpoint in Amazon SageMaker, choose the multi-model option, provide the inference serving container image path, and provide the Amazon S3 prefix in which the trained model artifacts are stored. You can organize your models in S3 any way you wish, so long as they all use the same prefix.

When you invoke the multi-model endpoint, you provide the relative path of a specific model with the new TargetModel parameter of InvokeEndpoint. To add models to the multi-model endpoint, simply store a newly trained model artifact in S3 under the prefix associated with the endpoint. The model will then be immediately available for invocations.

To update a model already in use, add the model to S3 with a new name and begin invoking the endpoint with the new model name. To stop using a model deployed on a multi-model endpoint, stop invoking the model and delete it from S3.

Instead of downloading all the models into the container from S3 when the endpoint is created, Amazon SageMaker multi-model endpoints dynamically load models from S3 when invoked. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency.

Monitoring multi-model endpoints using Amazon CloudWatch metrics

To make price and performance tradeoffs, you will want to test multi-model endpoints with models and representative traffic from your own application. Amazon SageMaker provides additional metrics in CloudWatch for multi-model endpoints so you can determine the endpoint usage and the cache hit rate and optimize your endpoint. The metrics are as follows:

ModelLoadingWaitTime – The interval of time that an invocation request waits for the target model to be downloaded or loaded to perform the inference.
ModelUnloadingTime – The interval of time that it takes to unload the model through the container’s UnloadModel API call.
ModelDownloadingTime – The interval of time that it takes to download the model from S3.
ModelLoadingTime – The interval of time that it takes to load the model through the container’s LoadModel API call.
ModelCacheHit – The number of InvokeEndpoint requests sent to the endpoint where the model was already loaded. Taking the Average statistic shows the ratio of requests in which the model was already loaded.
LoadedModelCount – The number of models loaded in the containers in the endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance, and the Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because you can load a model in multiple containers in the endpoint.

You can use CloudWatch charts to help make ongoing decisions on the optimal choice of instance type, instance count, and number of models that a given endpoint should host.

Inference Pipeline sagemaker

You can use trained models in an inference pipeline to make real-time predictions directly without performing external preprocessing. When you configure the pipeline, you can choose to use the built-in feature transformers already available in Amazon SageMaker. Or, you can implement your own transformation logic using just a few lines of scikit-learn or Spark code.

Refer https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html / https://catalog.us-east-1.prod.workshops.aws/workshops/f238037c-8f0b-446e-9c15-ebcc4908901a/en-US/002-services/003-machine-learning/020-sagemaker for more details.

Inference pipeline allows you to host multiple models behind a single endpoint. But in this case, the models are sequential chain of models with the steps that are required for inference. This allows you to take your data transformation model, your predictor model, and your post-processing transformer, and host them so they can be sequentially run behind a single endpoint.
As you can see in this picture, the inference request comes into the endpoint, then the first model is invoked, and that model is your data transformation. The output of that model is then passed to the next step, which is actually your XGBoost model here, or your predictor model.
- That output is then passed to the next step, where ultimately in that final step in the pipeline, it provides the final response or the post-process response to that inference request.
- This allows you to couple your pre and post-processing code behind the same endpoint and helps ensure that your training and your inference code stay synchronized

Sagemaker Production Variant

Amazon SageMaker enables you to test multiple models or model versions behind the same endpoint using production variants. Each production variant identifies a machine learning (ML) model and the resources deployed for hosting the model. By using production variants, you can test ML models that have been trained using different datasets, trained using different algorithms and ML frameworks, or are deployed to different instance type, or any combination of all of these. You can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant, or you can invoke a specific variant directly for each request. In this topic, we look at both methods for testing ML models.

Refer the notebook https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_endpoints/a_b_testing/a_b_testing.html for implementation details.

Test models by specifying traffic distribution

Specify the percentage of the traffic that gets routed to each model by specifying the weight for each production variant in the endpoint configuration.

Test models by invoking specific variants

Specify the specific version of the model you want to invoke by providing a value for the TargetVariant parameter when you call InvokeEndpoint.

Amazon SageMaker Batch Transform: Batch Inference

We’ll use the Sagemaker Batch Transform Jobs and a trained machine learning model. It is assumed that we have already trained the model, pushed the Docker image to ECR, and registered the model in Sagemaker.

we need the identifier of the Sagemaker model we want to use and the location of the input data
either use a built-in container for your inference image or you can also bring your own.
Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. When you have multiples files, one instance might process input1. csv , and another instance might process the file named input2. csv

In Batch Transform you provide your inference data as a S3 uri and SageMaker will care of downloading it, running the prediction and uploading the results afterwards to S3 again. You can find more documentation for Batch Transform here

If you trained a model using the Hugging Face Estimator, call the transformer() method to create a transform job for a model based on the training job (see here for more details): Refer https://huggingface.co/docs/sagemaker/inference

Note:

batch job has

instance count
instance type

transform job has

data location
content type

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

If you want to run your batch transform job later or with a model from the 🤗 Hub, create a HuggingFaceModel instance and then call the transformer() method:

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
    'HF_MODEL_ID':'distilbert-base-uncased-finetuned-sst-2-english',
    'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.6",                             # Transformers version used
   pytorch_version="1.7",                                  # PyTorch version used
   py_version='py36',                                      # Python version used
)

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    output_path=output_s3_path, # we are using the same s3 path to save the output with the input
    strategy='SingleRecord'
)

# starts batch transform job and uses S3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line'
)

The input.jsonl looks like this:

import json
from sagemaker.s3 import S3Downloader
from ast import literal_eval
# creating s3 uri for result file -> input file + .out
output_file = f"{dataset_jsonl_file}.out"
output_path = s3_path_join(output_s3_path,output_file)

# download file
S3Downloader.download(output_path,'.')

batch_transform_result = []
with open(output_file) as f:
    for line in f:
        # converts jsonline array to normal array
        line = "[" + line.replace("[","").replace("]",",") + "]"
        batch_transform_result = literal_eval(line) 

# print results 
print(batch_transform_result[:3])

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

📓 Open the notebook for an example of how to run a batch transform job for inference.

Speeding up the processing

We have only one instance running, so processing the entire file may take some time. We can increase the number of instances using the instance_count parameter to speed it up. We can send multiple requests to the Docker container simultaneously, too. The configure concurrent transformations we must use the max_concurrent_transforms parameter.

Processing the output

In the end, we must get access to the output. We’ll find the output files in the location specified in the Transformer constructor. Every line contains the prediction and the input parameters. agemaker-notebook.ipynb) for an example of how to run a batch transform job for inference.