DEV Community: Sauveer Ketan

Your Lambda Has All the Right Permissions — So Why Can't It Reach DynamoDB?

Sauveer Ketan — Wed, 25 Mar 2026 13:01:46 +0000

A real-world VPC + Lambda DNS puzzle that'll make you double-check your security groups forever.

🧩 The Scenario

Your team deploys a Lambda function that needs to read and write to a DynamoDB table. The setup looks solid:

✅ Lambda is attached to a VPC
✅ A VPC Endpoint for DynamoDB is configured (Gateway type)
✅ Lambda's IAM role has the correct DynamoDB permissions (dynamodb:GetItem, dynamodb:PutItem, etc.)
✅ The Lambda's Security Group has an outbound rule: All TCP → 0.0.0.0/0
✅ DNS resolution and DNS hostnames are enabled on the VPC
✅ An EC2 instance in the same VPC can successfully resolve the DynamoDB DNS name AND access the table

Yet, Lambda invocations keep failing with a connection/timeout error. It simply cannot resolve the DynamoDB DNS name.

🤔 Take a Moment — What Do You Think Is Wrong?

You've verified IAM. You've verified the VPC endpoint. DNS is enabled at the VPC level. Even an EC2 in the same VPC works perfectly.

The Lambda function is not in a public subnet issue. It's not a missing route table entry.

Seriously — take 30 seconds. What's different between how an EC2 instance handles DNS vs. a Lambda function locked down by a Security Group?

...
...
...
...
...
...
Think about the OSI model. 🤔
...
...
...
...
...
...
What protocol does DNS primarily use?
...
...
...
...
...
...

💡 The Answer

The Lambda's Security Group was missing an outbound rule for UDP traffic.

Specifically, this rule was absent:

Type	Protocol	Port Range	Destination
Custom UDP	UDP	53	0.0.0.0/0

Or more broadly (to cover both TCP and UDP for DNS):

Type	Protocol	Port Range	Destination
All traffic	All	All	0.0.0.0/0

🔬 Why This Happens — The Root Cause Explained

DNS Uses UDP (Primarily)

DNS queries almost always go out over UDP port 53. TCP port 53 is used only for larger responses (like zone transfers or responses exceeding 512 bytes). For typical hostname resolution — like resolving dynamodb.us-east-1.amazonaws.com — UDP is the default protocol.

The Security Group had:

Outbound: All TCP → 0.0.0.0/0   ✅
Outbound: All UDP → 0.0.0.0/0   ❌ MISSING

So when Lambda tried to resolve the DynamoDB endpoint DNS name before establishing a connection, the UDP DNS query was silently dropped by the Security Group. No DNS resolution = no connection = timeout.

Why Did EC2 Work But Lambda Didn't?

This is the clever part. The EC2 instance likely had a broader Security Group — perhaps All traffic → 0.0.0.0/0 — or one that explicitly included UDP. Since nobody usually thinks about restricting outbound on EC2 instances during testing, it just worked.

Lambda, being more of a "locked-down by default" environment in VPCs, often gets a tighter security group — and TCP-only outbound is a very common mistake.

Why Didn't the VPC DNS Setting Matter Here?

enableDnsResolution on the VPC controls whether instances can use the Route 53 Resolver (at the VPC base IP + 2, e.g., 10.0.0.2). But that resolver is still accessed over UDP port 53 from the Lambda's network interface. If the Security Group blocks outbound UDP, the Lambda ENI (Elastic Network Interface) can never reach the resolver — regardless of the VPC-level DNS settings.

✅ The Fix

Update the Lambda's Security Group outbound rules to include UDP:

Option 1 — Minimal / Precise (recommended for production):

Direction	Type	Protocol	Port	Destination
Outbound	Custom UDP	UDP	53	VPC CIDR or 0.0.0.0/0
Outbound	Custom TCP	TCP	443	0.0.0.0/0 (for HTTPS to DynamoDB endpoint)

Option 2 — Simple / Permissive (fine for dev/test):

Direction	Type	Protocol	Port	Destination
Outbound	All traffic	All	All	0.0.0.0/0

🗝️ Key Takeaways

DNS runs on UDP port 53 — always allow outbound UDP 53 in Lambda security groups when using VPC.
"All TCP" ≠ "All Traffic" — a very easy mistake to make in the AWS console.
EC2 working ≠ Lambda will work — EC2 and Lambda often have different Security Groups with different outbound rules.
VPC Endpoints don't bypass DNS resolution — Lambda still needs to resolve the DynamoDB hostname before the VPC endpoint routing kicks in.
Silence is not golden in Security Groups — blocked UDP traffic produces no error, just a timeout, making this particularly tricky to debug.

🛠️ Quick Debugging Checklist for Lambda-in-VPC DNS Issues

[ ] Does the Security Group allow outbound UDP 53?
[ ] Does the Security Group allow outbound TCP 443 (for AWS service endpoints)?
[ ] Is DNS Resolution enabled on the VPC?
[ ] Is DNS Hostnames enabled on the VPC?
[ ] Is there a route in the subnet's route table pointing to the VPC Endpoint?
[ ] Does the VPC Endpoint policy allow the Lambda's IAM role?

Found this useful? Drop a 💬 comment with your own sneaky AWS networking gotchas — there are plenty of them out there!

Follow for more real-world AWS and Linux troubleshooting posts in this series.

Strands AI Functions: Write Python Functions in Natural Language

Sauveer Ketan — Wed, 25 Mar 2026 07:21:45 +0000

This post is written for architects and developers already familiar with Amazon Strands Agents SDK.

AWS's experimental new library lets you write AI-powered Python functions in natural language — and the LLM writes the implementation at runtime.

Most real-world agentic workflows still require a lot of traditional code. For example, imagine accepting an uploaded invoice file in an unknown format and converting it into a clean, normalized DataFrame for use in the rest of the workflow. With traditional code, you write format-detection logic, transformation pipelines, prompt templates, response parsers, and retry loops — dozens of lines before you've even gotten to the business logic. What if you could just describe what you want and let the model figure out the rest?

That's exactly what Strands AI Functions is designed to do. Released by AWS as part of the newly launched Strands Labs experimental organization, AI Functions is a Python library that gives developers a disciplined, intent-based approach to build reliable, AI-powered pipelines — without writing traditional prompt orchestration, parsing, and retry logic for the AI components.

"At a high level, AI Functions lets you describe intent, while the framework handles execution, correction, and validation."

What Is Strands Labs?

Before diving into AI Functions, a quick note on where it comes from. In early 2026, AWS launched Strands Labs — a separate GitHub organization designed as an innovation sandbox for experimental agentic AI projects. Think of it as the frontier research wing of the Strands Agents SDK.

Strands Labs launched with three projects: Robots (physical AI agents), Robots Sim (simulation environments), and AI Functions — the one that should immediately catch the attention of any developer or architect building AI-powered pipelines.

AI Functions is experimental. Expect breaking changes. It is not yet production-ready, but the concepts are production-relevant today — understanding them now puts you ahead of the curve.

The Core Idea: Functions Written in Natural Language

An AI Function looks like a normal Python function decorated with @ai_function. But instead of writing code in the function body, you write a docstring in natural language that describes what the function should do.

AI Functions are implemented on top of the Strands Agent runtime. Any valid option of strands.Agent (such as model, tools, system_prompt) can be passed in the decorator.

When an AI Function is called, the library will automatically:

Create a Strands agent
Generate a prompt based on the docstring template and the provided arguments
Parse and validate the result
Return it as a typed Python object

From the outside, it behaves like any other Python function.

from ai_functions import ai_function
from pydantic import BaseModel

class MeetingSummary(BaseModel):
    attendees: list[str]
    summary: str
    action_items: list[str]

@ai_function
def summarize_meeting(transcripts: str) -> MeetingSummary:
    """
    Write a summary of the following meeting in less than 50 words.
    <transcripts>
    {transcripts}
    </transcripts>
    """

# Call it just like any normal Python function
result = summarize_meeting(transcript_text)
print(result.summary)

This function takes typed inputs, returns a typed Pydantic model, and the library handles everything in between — creating the agent, running the model, parsing the output, and returning a validated Python object. To the rest of your codebase, it's just a function call.

Post-Conditions

This is the feature that sets Strands AI Functions apart from every other "just call the LLM" approach. The core philosophy is that you should never rely on prompt engineering alone to guarantee output correctness. Instead, you define post-conditions — validation functions that run after the AI produces its output.

If a post-condition fails, the library automatically feeds the error back to the agent in a self-correcting loop, up to a configurable number of attempts. Your pipeline either gets a validated result or fails cleanly — no silent garbage output sneaking through.

from ai_functions import ai_function, PostConditionResult

# Standard Python validator
def check_length(response: MeetingSummary):
    length = len(response.summary.split())
    assert length <= 50, f"Summary is {length} words, must be ≤ 50"

# Or use another AI Function as a validator!
@ai_function
def check_style(response: MeetingSummary) -> PostConditionResult:
    """
    Check if the summary uses bullet points and provides sufficient context.
    <summary>{response.summary}</summary>
    """

@ai_function(post_conditions=[check_length, check_style], max_attempts=5)
def summarize_meeting(transcripts: str) -> MeetingSummary:
    """Write a concise meeting summary. <transcripts>{transcripts}</transcripts>"""

Notice that a post-condition can itself be an AI Function — enabling sophisticated validation of stylistic or semantic constraints that would be impossible to express in pure Python logic.

The Three Pillars of AI Functions

1. Natural Language Instructions

Describe what you want in plain language, either as a docstring or a returned string from the function body.

2. Post-Conditions

Define explicit validation rules that the AI output must satisfy, triggering automatic self-correcting retries.

3. Python Integration

AI Functions aim to feel like a natural extension of the programming language itself, enabling new kinds of programming patterns and abstractions. They return real Python objects and integrate directly into existing codebases, rather than producing raw text.

The Universal Data Loader: A Good Use Case

Consider one of the most compelling examples in the official documentation. You're building a webapp that accepts invoice uploads in any format — JSON, CSV, PDF, SQLite. Normally, you'd write format-detection logic and separate transformation pipelines for each format.

With AI Functions and code_execution_mode="local" enabled, the agent inspects the file at runtime, determines the format, writes the appropriate loading and transformation code, and returns a properly-typed Pandas DataFrame — complete with schema validation via a post-condition.

When using a Python executor (code_execution_mode="local"), all input variables to the AI function are automatically loaded into the Python environment. This means the agent can directly reference and manipulate these variables in the generated code without needing to parse them from the prompt.

@ai_function(
    post_conditions=[check_invoice_dataframe],
    code_execution_mode="local",
    code_executor_additional_imports=["pandas", "sqlite3"],
)
def import_invoice(path: str) -> DataFrame:
    """
    The file `{path}` contains purchase logs. Extract them into a DataFrame
    with columns: product_name (str), quantity (int), price (float),
    purchase_date (datetime).
    """

# Works on JSON, CSV, SQLite - the agent figures it out
df = import_invoice('data/invoice.json')
df2 = import_invoice('data/invoice.sqlite3')

In practice, this works best when combined with strong post-conditions and constrained execution environments.

⚠️ Security Note

Right now, Strands AI Functions support only "local" execution. Local code execution carries inherent risk. AWS recommends running this inside a Docker container or sandbox environment. Remote sandboxed execution is on the roadmap. Things to consider:

Run local execution in read-only filesystems
Use network restrictions
Strip secrets from runtime environment
Treat AI-generated code as untrusted by default
Add observability — because each agent step is explicit, failures and retries are inspectable, making it far easier to debug than ad-hoc prompt pipelines

Async and Parallel Workflows

AI Functions support async definitions natively, enabling parallel agentic workflows that dramatically reduce wall-clock time. In the stock report example from the official documentation, two research agents run concurrently using asyncio.gather() before their results are combined into a final report — a pattern that maps perfectly onto real-world multi-step analysis pipelines.

@ai_function(tools=[...])
async def research_news(stock: str) -> str:
    """Research and summarize current news for: {stock}"""

@ai_function(tools=[...])
async def research_price(stock: str, past_days: int) -> DataFrame:
    """Retrieve 30-day historical prices for {stock} using yfinance."""

async def stock_workflow(stock: str):
    # Both agents run in parallel
    news, prices = await asyncio.gather(
        research_news(stock),
        research_price(stock, past_days=30)
    )
    write_report(stock, news, prices)

AI Functions as Tools in Multi-Agent Systems

Here's where it gets architecturally interesting — AI Functions can be registered as tools within other agents — both other AI Functions and regular Strands Agents. This creates a composable, hierarchical agent architecture where each layer does exactly what it's best at.

@ai_function(description="Search the web and return a summary", tools=[...])
def websearch(query: str) -> str:
    """Research `{query}` online and return a summary of findings."""

@ai_function(tools=[websearch])  # websearch is now a tool for this agent
def report_writer(topic: str) -> str:
    """Research the following topic and write a report: {topic}"""

# Also works with regular Strands Agents:
# agent = Agent(model=..., tools=[websearch])

Why This Matters for AWS Architects

If you're working with Strands, this library slots in naturally. The default model is Claude on Bedrock, and you can swap in any Strands-supported model.

For architects and platform engineers, the important takeaway is not just the library itself, but the pattern it represents:

Separation of intent from implementation — declare what you need, not how to do it
Deterministic guardrails around non-deterministic AI using post-conditions
Composable, reusable AI components — functions as first-class building blocks for agent graphs
Native Python support — return real objects, not raw strings, maintaining type safety across your pipeline

As the lines between traditional software engineering and AI engineering continue to blur, frameworks like Strands AI Functions point toward a near future where the "implementation" of a function and the "intent" of a function can finally be decoupled — the developer specifies the intent, the model fulfills it, and post-conditions enforce it.

When NOT to Use AI Functions

These functions are inherently non-deterministic, which makes them a poor fit for certain scenarios:

Low-latency paths
Hard real-time requirements
Strong deterministic compliance constraints (e.g., finance)

Additionally, AI Functions can incur higher costs than deterministic pipelines due to repeated LLM invocations (retries, validation passes, and tool calls). Cost controls and limits are therefore essential.

Getting Started

If you want to experiment hands-on, getting started is refreshingly simple. The official documentation includes a clean QuickStart with the meeting summarizer example, and the Strands Labs GitHub has complete examples for stock report generation, multi-agent orchestration, and context management for long-running tasks. The AWS Blog release post also provides more details.

My AWS Golden Jacket Playbook: Study Methods That Work for Working Parents

Sauveer Ketan — Tue, 18 Nov 2025 11:49:02 +0000

Picture this: Already tired from work (1PM — 11 PM), you also spent one hour to make your one-year-old go to sleep, it's 1:30 AM, bleary-eyed, you're studying various reasons behind overfitting in a ML model while tomorrow you have to give a demo to a client. This was the reality of my 2.5-month sprint to earning the AWS Golden Jacket.

After 16+ years in IT and countless late-night experiences, I thought I had seen it all. But nothing quite prepared me for the whirlwind that would be my journey to earning the AWS Golden Jacket.

AWS Golden Jacket is a special way to acknowledge the talent demonstrated by AWS Partners. Earning an AWS Gold Jacket involves holding all active AWS certifications and submitting an application through your company's AWS alliance partner. It was 12 certifications in my case; I completed the final one in Aug 2025.

The Starting Line: 6 Down, 6 to Go

Let me be honest: I didn't start from scratch. As an AWS Solutions Architect with extensive AWS experience, I already had 6 AWS certifications under my belt including two professional ones and security specialty. I can say that except machine learning certifications, I already had a lot of knowledge for other certifications. The Golden Jacket wasn't even on my radar initially. I got a push from the leadership of my organization and thought, why not! I just wanted to deepen my understanding of AWS services, and get familiar with those services I hadn't worked with extensively, for example those related to machine learning.

The 2.5-Month Marathon

Between May 30th and August 14th, 2025, I tackled six certifications back-to-back. No firm deadline, just pure momentum. Six certifications in 2.5 months while maintaining a full-time job and being a father to a one-year-old. Let's just say my coffee consumption reached unprecedented levels, and sometimes sleep became a luxury.

Here's what my typical day actually looked like:

Morning to Evening: Regular work hours (mix of office and work-from-home). Study in between, whenever I could.
Babysitting: My one-year-old had other plans for my "focused study time." He'd knock on my home office door and demand attention whenever he felt like it. Whatever I planned — locking doors, scheduled study blocks — it didn't work. I had to spend multiple precious hours with my baby, and honestly, I wouldn't have it any other way.
10:00–11:00 PM: Finally finishing office work and daily responsibilities.
11:00 PM — 1:00/2:00 AM: My golden study window — when the house was quiet and I could focus. Those late-night hours became sacred. While my family slept, I'd dive into my study routine, knowing that these 2–3 hours were all I had before another demanding day began. On weekends, I got a few hours extra hours, if I was lucky.

My Method: One Source, Max Speed

Let me share what actually worked for me. Relying on few trusted study material sources is the key. I feel, there is no need to follow multiple sources for same things. I have followed similar methods for my previous certifications, and learning anything in general, also. I also hold 10 Azure certifications and 2 Google Cloud certifications, including their topmost Solution Architect certifications.

Batching Certs for Efficiency

I clubbed the certifications based on their content and decided the order in which I will take them:

Developer Associate + SysOps Administrator Associate (similar operational concepts)
Data Engineer Associate (standalone but foundation for ML)
ML Associate + ML Specialty (complementary ML focus)
Advanced Networking Specialty (the beast everyone fears)

Primary Learning Source

Stephane Maarek's Udemy courses were my primary sources. His teaching style is perfect for working professionals — clear, comprehensive, and practical. The fact that he collaborates with experts like Frank Kane for machine learning, Abhishek Singh, and Chetan Agrawal for various domains means you're getting top-tier instruction across all AWS services. For some tricky concepts, I also used Gen AI (ChatGPT, Gemini), for example — confusion matrix in Machine Learning or various BGP configurations in Advanced Networking Specialty. There are many other good courses and tests, like Jon Bonso, Adrian Cantrill, Neal Davis, etc., but I like sticking with a single source.

Speed Learning

I watched most content at 1.5x to 2x speed, depending on my familiarity with the topic. New concepts got the full treatment at normal speed, but review materials flew by at 2x. Time was precious when you only have 2–3 hours a night.

AWS Console Walkthrough

When time permitted, I'd dive into the AWS console to explore service configurations hands-on. For simpler services like App Runner, I'd actually create resources to understand the workflow. For complex, expensive services like Redshift or EMR clusters, I'd navigate through the creation process without deploying — just to familiarize myself with configuration options, pricing models, and architectural considerations.

Mind Mapping

I love mind-maps. I used SimpleMind app on my phone to create detailed mind maps during study sessions. Being able to quickly capture concepts and relationships on my phone meant I could review them anywhere — during lunch breaks, while travelling, even while my son played around me.

A sample snapshot of one of my mind-maps -

Practice Tests

If I had buffer time before exams, I also took Udemy practice tests from same creators. They don't just test knowledge — they teach you how AWS wants you to think about problems. I will update my mind maps during their revision also.

Final Review Strategy

After finishing the course and tests, I'd abandon new content entirely and just review my SimpleMind maps. Having all the key concepts in visual format made last-minute reviews incredibly efficient.

The Toughest Moments (Spoiler: It Wasn't Advanced Networking)

The Advanced Networking specialty is a tough one, but not as much as it is pointed everywhere. It is considered to be the toughest, but I think it depends on your prior experience. Tricky part is hybrid networking if you don't have hands-on experience, but that can be learned through focused study to clear this certification. Diving deep into Direct Connect Gateway, VPC Lattice, and BGP routing at midnight was an exhilarating experience.

For me the toughest exam experience was Data Engineer Associate. I thought I am ready, but in the exam most questions felt similar and many times it was difficult to distinguish between choices. It was also reflected in my score, which was 78%, my lowest score in any certification. While I have scored 80+ in all others.

The real challenge was the daily grind. Sometimes, out of pure frustration and exhaustion, I'd just watch a sitcom episode and call it a night. Some weeks were so hectic that I didn't study even on weekends — I just relaxed instead.

What I Actually Learned (Beyond the Certs)

Journey is more important than the result, and the real learning was deeper:

Humility: Even with 16 years of experience, and 7+ of them in AWS, there's always more to learn. AWS's breadth is genuinely mind-boggling.

Pattern Recognition: Our brains are pattern recognition machines. You start seeing how AWS services interconnect in ways that aren't obvious from the outside. Everything truly is connected. Patterns and analogies are also a good way to learn and remember concepts.

Perspective Shifting: Each specialty certification forced me to think like a different type of practitioner — a ML engineer, a security architect, a network engineer. This perspective shift was invaluable.

Time Management: Squeezing study time in between everything else, learning to maximize those precious late-night hours taught me efficiency.

The real reward isn't the jacket itself — it's the confidence to walk into any AWS conversation knowing I can contribute meaningfully rather than being completely clueless (I mainly mean Machine Learning!).

For Those Considering the Journey

I won't recommend this sprint approach to anyone, instead do it gradually by making a plan.

My advice: Take key certifications first that align with your work area. Certifications alone open a few doors, but in interviews, you need real knowledge. Real knowledge comes through practice and hands-on experience. If your job doesn't provide this, invest in your personal AWS account for practical learning.

Even if you're not planning certifications, study the materials and take practice tests to enhance your knowledge. The structured learning path is valuable regardless.

What's Next?

The Golden Jacket is just a trophy. AWS releases new services constantly, and existing services evolve rapidly. The real challenge is staying current and applying this knowledge to solve real-world problems.

I keep my eyes on AWS updates constantly because in cloud computing — and the IT industry generally — the moment you stop learning is when you start falling behind.

Final Thoughts

Sixteen years in IT taught me that technology changes, but the fundamentals of good architecture remain constant: understand your requirements, choose the right tools, and never stop questioning your assumptions.

The AWS Golden Jacket represents technical breadth, but more importantly, it represents a commitment to continuous learning. In a field that evolves as rapidly as cloud computing, that commitment matters more than any certification badge.

To my fellow cloud architects out there — especially those juggling career advancement with family life — whether you're pursuing your first AWS cert or your twelfth, remember that the journey is just as valuable as the destination. Every late-night study session, every failed practice exam, and every "aha!" moment at 1 AM contributes to making you a better architect.

Have you pursued AWS certifications while balancing work and family? What strategies worked for you? Share your experience in the comments!

Building Better on AWS: A Practical Guide to the Well-Architected Framework

Sauveer Ketan — Mon, 17 Nov 2025 09:10:41 +0000

AWS is huge, hundreds of services and thousands of features. So many services, so many possibilities, and hence, so many ways to mess things up.

A misconfigured S3 bucket. That's all it took for Capital One to lose 100 million customer records in 2019. What if someone had asked the right questions about their architecture?

What if there is a free tool to systematically review your entire environment and workloads against AWS best practices? To top it up, what if AWS pays you to fix these risks?

How do we build better on AWS? That's where the AWS Well-Architected Framework comes in.

What Actually Is This Framework?

Think of the AWS Well-Architected Framework as your architectural guardrails — a set of battle-tested best practices that AWS has compiled from working with thousands of customers. It's not some rigid rulebook that you have to follow to the letter. Instead, it's more like having a seasoned architect sitting next to you, asking the right questions about your workloads and environment and providing you feedback and action plan.

The framework is built on six pillars, and each one addresses a crucial aspect of your cloud architecture. Each of these contain design principles, questions, and best practices. Design principles are high level guidelines and best practices are actual recommendations.

Operational Excellence is about deploying, running and monitoring systems to deliver business value. The operational excellence pillar contains best practices for organizing your team, deploying your workload, operating it at scale, and evolving it over time. You will see DevOps and ITIL processes here.

Security covers protecting data, systems, and environments. This isn't just about checking compliance boxes — it's about actually understanding your entire security spectrum — preventive and detective. In one recent case, a Palo Alto ec2 server was down and no one knew why — AWS had sent the health notification to stop and start it because of degraded hardware, but no one was receiving those emails. It was going to a single person who had left the organization!

Reliability ensures your workload performs its intended function correctly and consistently. This includes the ability to operate and test the workload through its total lifecycle with resilience. Latest Oct 2025 AWS outage must have made everyone realize the importance of baking reliability into critical workloads.

Performance Efficiency is the ability to use cloud resources efficiently to meet performance requirements, and to maintain that efficiency as demand changes and technologies evolve. A normally highly performant system might face performance bottlenecks during peak demand period, if not planned properly. For example, Pre-warming is one of the ways to handle this.

Cost Optimization means avoiding unnecessary costs. This pillar has saved organizations literal thousands of dollars. We have found EC2 instances from a proof-of-concept that had been running forgotten for eight months. We have found DMS instances running 2 years after migration completed. We have found thousands of unattached EBS volumes and unnecessary snapshots. These are just a few examples. As the current wisdom says, while architecting workloads, cost should always be considered as a non-functional requirement.

Sustainability pillar focuses on minimizing the environmental impact of your cloud workloads, especially energy consumption and efficiency. It's about being smart with resource usage — not just for the planet, but for your wallet too.** For example,** AWS Graviton-based Amazon EC2 instances use up to 60% less energy than comparable EC2 instances for the same performance. They also provide the best price performance for cloud workloads running on Amazon EC2. Over 70,000 customers have used AWS Graviton to build efficient and performant workloads as of now (2025).

AWS has provided an excellent mind-map for this, where different entities are clickable and lead to relevant documentation.

Check it at Map of the AWS Well-Architected Framework

Enter the AWS Well-Architected Tool

Now, you might be thinking, "This all sounds great, but how do I actually apply this to my architecture?" There are 6 pillars, 57 different questions, and multiple best practices against each of these questions. That's where the AWS Well-Architected Tool (WA Tool) becomes your best friend.

The WA Tool is a free service available in the AWS console that helps you review your workloads against these pillars. It's basically an interactive questionnaire that walks you through each pillar, asking you specific questions about your architecture.

How It Actually Works

Here's what the experience looks like in practice:

You start by defining a workload. This could be anything — a microservice, an entire application, or even a data pipeline. The tool then presents you with a series of questions for each pillar. These aren't yes/no questions; they're thoughtful, sometimes challenging questions that make you really think about your design decisions.

For example, under Security, you might get asked: "How do you protect your network resources?" The tool then offers multiple choice answers based on best practices, and you select what applies to your workload.

What I really appreciate is that for each question, there's context. The tool explains why the question matters and what the implications are of different approaches. It's educational.

Important Features of WA Tool

Consolidated Reports: Not only for your individual workloads, but if you're managing multiple workloads, you can generate reports across all of them. This is invaluable for getting an organizational view of your cloud architecture health.

Risk Identification: After you complete the review, you can generate a report which shows high-risk issues (HRIs) and medium-risk issues (MRIs). These aren't generic warnings — they're specific to what you told the tool about your architecture. On the dashboard, we can see visualization for all workloads also. Seeing those red flags visualized really helps prioritize what to tackle first.

Improvement Plans: The tool doesn't just point out problems; it suggests remediation steps. Each identified risk comes with links to documentation, whitepapers, and specific AWS services that can help address the issue.

Milestones: You can save snapshots of your reviews over time. This is fantastic for tracking improvements. For example, we can run quarterly reviews and would be able to show the executives how we've systematically reduced our high-risk items from 12 to 2 over last quarter. This will make the investment in improvements really tangible.

Lenses: Beyond the standard six pillars included in default WAF lens, AWS offers specialized lenses for specific use cases. There's a DevOps lens, Serverless Lens, a SaaS Lens, and several others. There are industry specific lenses like Healthcare and Financial services, which are very helpful considering compliance requirements of these industries. Of course, there is a Generative AI lens now, the hottest IT industry buzzword right now. These include relevant questions and best practices for their areas.

Custom Lenses: If your organization has specific standards or requirements, you can create custom lenses. For example, enterprises can use this to encode their security policies or compliance requirements directly into the review process. These custom lenses can also be shared with other accounts or your entire AWS organization.

Review Templates: These help in standardization. You can create review templates in AWS WA Tool that contain pre-filled answers for Well-Architected Framework and custom lens best practice questions. Well-Architected review templates reduce the need to manually fill in the same answers for best practices that are common across multiple workloads when performing a Well-Architected review, and they help drive consistency and standardization of best practices across teams and workloads. You can create a review template to answer common best practice questions or create notes, which can be shared with another IAM user or account, or an organization or organizational unit in the same AWS Region. You can define a workload from a review template, which helps scale common best practices and reduce redundancy across your workloads.

Profiles: You can create profiles to provide your business context, and identify goals you'd like to accomplish when performing a Well-Architected review. AWS Well-Architected Tool uses the information gathered from your profile to help you focus on a prioritized list of questions that are relevant to your business during the workload review. Attaching a profile to your workload also helps you see which risks are prioritized for you to address with your improvement plan.

AWS Funding: AWS pays you for fixing your issues! Businesses can receive $5,000 in AWS credits to offset the cost of remediating issues identified during an AWS Well-Architected Framework (WAF) Review as of this writing. To qualify, you must partner with a certified Well-Architected Partner to conduct the review. Check with your AWS partner or AWS TAM on this.

Real-World Insights

Let me share some practical wisdom from actually using this framework:

Start Small: Don't try to review your entire infrastructure in one sitting. Pick one critical workload and go through the exercise thoroughly. Start with non-prod environment to get a hang of it. Maybe your newest project where you can actually implement changes quickly.

The First Review Is Always Humbling: Even systems designed by experienced architects will have gaps. That's okay — that's the point. The framework represents the collective wisdom of thousands of AWS architects. It's supposed to teach you something. Also, cloud is always evolving and bringing in better ways to do things.

Make It a Team Activity: Running through the questions with your team is way more valuable than having one person fill it out alone. For example, the discussion around "How do you test reliability?" might reveal that the developers thought they had comprehensive testing, but ops was manually verifying deployments. This insight alone can prevent a future incident.

High-Risk Doesn't Always Mean "Drop Everything": Context matters. For example, tool can flag a development environment for not having multi-region failover. Technically a risk, but for a dev environment? Not worth the complexity. Use your judgment.

The 80/20 Rule Applies: Pareto principle comes into play here also, just like almost everywhere else. About 20% of the recommendations typically address 80% of your actual risk. Focus on the high-risk items first, especially around security and reliability. You can optimize costs and performance iteratively.

Revisit Regularly: Your architecture isn't static, and neither should your Well-Architected reviews be. It is recommended to conduct quarterly reviews for production workloads. New features get added, traffic patterns change, and AWS releases new services that might better address your needs. For example, security groups can be shared across VPCs and accounts now, making their centralized management possible.

Use It for New Projects: Here's a pro tip — run through the relevant questions before you build something new. Use the Well-Architected questions as a checklist during design phases. It's way easier to build security in from the start than to retrofit it later.

A Real Example

Let me tell you about a project from some time back. A client had migrated one of their data centers to cloud 3 years back, but there was no proper cloud team and very few things were properly configured. For example, they did not even have default EBS encryption enabled, which can be done easily in seconds at account level. Their monthly AWS bill was creeping up, and they couldn't figure out why. I was mainly engaged for cost optimization, with secondary emphasis on everything else.

We ran a Well-Architected review focusing heavily on the Cost Optimization first. The review revealed several issues:

Multiple instances in shut down state for months or years.
Their RDS instances were over-provisioned for average load.
Hundreds of unattached EBS volumes.
Their Backup was being retained forever (20 TB of snapshots). They had no such compliance requirements.

The Cost Optimization pillar helped us identify these issues systematically. Within two months, we had:

Decommissioned unused servers
Right-sized EC2 and RDS instances
Deleted unattached EBS volumes
Modified backup retention policy to 3 weeks

The result? Their monthly bill showed good improvement. More importantly, systems and a regular rhythm was put in place. During this process, we enabled Compute Optimizer, Trusted Advisor and Cost Optimization Hub. We set up budget alerts and cost anomaly detection alerts. Teams started receiving notifications to gain visibility into their bills. A bi-weekly call was set up in place to constantly review the findings and assign action items.

After this, we did a review for other pillars and created a comprehensive action plan. The pillars were customized so that instead of individual workloads, we were focusing on the entire landing zone. Now they have a well architected landing zone, lower bill, and fewer incidents.

Getting Started Today

If you want to try this out, here's what you should do:

Log into the AWS Console and search for "Well-Architected Tool"
Click "Define workload" and pick something meaningful but manageable
Set aside 2–3 hours with your team
Go through one or two pillars thoroughly rather than rushing through all six
Focus on understanding the "why" behind each question
Generate your report and prioritize the high-risk items
Create tickets or action items for addressing the gaps
Schedule your next review in 3–6 months

The framework isn't magic — it won't automatically fix your architecture. But it will give you a systematic way to think about your systems, identify blind spots, and continuously improve. And that's exactly what separates good cloud architectures from great ones.

Resources for Deep Dive

Documentation: AWS Well-Architected Framework
Well Architected Lab: AWS Well-Architected Labs
Well-Architected Tool Workshop: Workshop

Have you used the Well-Architected Framework? Have you found any surprising issues in your WA reviews? I'd love to hear about it in the comments.

Agentic AI — Automating AWS Tasks with Amazon Bedrock Agents and a Custom Web App

Sauveer Ketan — Thu, 18 Sep 2025 19:32:22 +0000

Amazon Bedrock Agents are conversational AI applications that extend the capabilities of Large Language Models (LLMs) by enabling them to understand multi-step user requests and orchestrate actions across different systems. They act as intelligent intermediaries, breaking down complex tasks, reasoning about the required steps, maintaining conversation context, and using defined tools to interact with external services to fulfill user goals.

You can leverage Bedrock Agents extensively for AWS operations by integrating them with AWS services via Action Groups linked to tools like Lambda functions or APIs. This allows agents to perform automated resource management tasks, such as listing EC2 instances, enumerating S3 buckets, or creating new servers based on natural language commands. They can also retrieve operational information for troubleshooting or provide a simplified, conversational interface for users to interact with and manage AWS resources without needing deep technical expertise or direct console access.

An action group defines actions that the agent can help the user perform. For example, listing EC2 servers. You define the parameters and information that the agent must elicit from the user for each action in the action group to be carried out. You also decide how the agent handles the parameters and information it receives from the user and where it sends the information it elicits from the user. Knowledge bases can also be configured optionally.

This article will guide you through setting up a Bedrock Agent with Action Groups and then building a custom web application around it. We will create and integrate AWS Lambda functions that allow the agent to:

List existing EC2 instances
List existing S3 buckets
Create a new EC2 instance with specified parameters

This demonstrates how you can leverage large language models (LLMs) orchestrated by Bedrock Agents to perform actions in your AWS account based on natural language commands, and then provide a user-friendly interface for those actions.

Prerequisites

To follow along with this tutorial, you will need:

AWS Account: An active AWS account with necessary permissions to create and manage Bedrock Agents, IAM roles, Lambda functions, EC2 instances, and view S3 buckets
Amazon Bedrock Access: Ensure Amazon Bedrock is enabled in your account and the AWS region you plan to use
Bedrock Model Access: You need access to a supported model for Bedrock Agents, such as Anthropic Claude. Follow the steps in the Bedrock console under "Model access" to enable this

Step 1: Enable Bedrock Model Access

Under the Amazon Bedrock console, go to "Model access" (bottom left). Click "Modify access," select the Anthropic Claude Sonnet 3.5 models, and submit.

Note: If you are in India, your payment mode should be invoice-based, not card-based, to avoid errors about an invalid payment mode.

After a few minutes, you will get access, and these models will become active in your console.

Step 2: Create AWS Lambda Functions

We will create three Lambda functions. Each function performs a specific AWS task and returns the result in a format Bedrock Agents expect. Deploy these functions in the same AWS region where you configure your Bedrock Agent.

You can find the complete Python code for these functions in the following GitHub repository:
https://github.com/sauveerk/projects/tree/main/Code/Gen-AI-Bedrock-Agents

Lambda Function 1: List EC2 Instances

This function, action_group_ec2, lists the IDs of all EC2 instances in the current region
IAM Permissions Needed: ec2:DescribeInstances

Lambda Function 2: List S3 Buckets

This function, action_group_s3, lists the S3 buckets
IAM Permissions Needed: s3:ListBuckets

Lambda Function 3: Create EC2 Instance

This function, action_group_create_ec2, creates an EC2 server. It takes the instance type and region as parameters
IAM Permissions Needed: ec2:RunInstances, ec2:DescribeInstances

Step 3: Create Bedrock Agent and Action Groups

In the Bedrock console, navigate to "Agents" under "Builder Tools." Click "Create agent." Give your agent a name and a description (e.g., "AWS Resource Manager Agent").

After the agent is created, click on it and select "Edit" in the Agent Builder.

Select "Create and use a new service role" and define instructions for the agent. This prompt guides the agent's behavior. Something like:

You are an AI assistant that can help users manage their AWS resources. You can list EC2 instances, list S3 buckets, and create new EC2 instances.

Under the "Action Groups" section, click "Add." Give the action group a name (e.g., Ec2ManagementActions). Provide a description (e.g., "Actions for managing EC2 resources"). Under "Action group invocation," select "Define with Function details."

Select "Quick create a new Lambda function." This will create a Lambda function and add a resource-based policy to it so that it can be invoked by the Bedrock agent. Provide the function name and description. No parameters are required for this function as it only lists EC2 instances.

You can select "Enable confirmation of action group function" here. The agent will ask for confirmation before proceeding with the invocation. It is disabled by default. Click save to create the action group.

Click on the action group again. You can see the newly created Lambda function. Click "View," which will take you to the Lambda console. Copy the code from the GitHub repo given above and paste it into the function, then deploy it.

Go to the Lambda function configuration, increase the timeout duration. Then, go to the "Permissions" section and add additional permissions to the role being used by the Lambda function. In this case, it is ec2:DescribeInstances.

Now the Lambda function is ready. You can test it by giving a test event in the appropriate format:

{
  "messageVersion": "1.0",
  "function": "bedrock-agent-list-ec2",
  "inputText": "",
  "sessionId": "888900372248953",
  "agent": {
    "name": "agent-aws-resources",
    "version": "DRAFT",
    "id": "VGGUJUPSEC",
    "alias": "TSTALIASID"
  },
  "actionGroup": "action_group_ec2",
  "sessionAttributes": {},
  "promptSessionAttributes": {}
}

Follow similar steps to add the other two action groups. For the EC2 create function, remember to add the parameters also.

Save the agent in the agent builder.

Step 4: Test the Bedrock Agent

Once your agent is created, you can test its ability to invoke your Lambda functions from the Bedrock console. On the agent details page, click the "Prepare" button. This compiles the agent configuration. Wait for the status to show "Prepared." Click the Prepare button after configuring the agent.

Stay on the agent details page and use the "Test" panel on the right. Because we enabled confirmation for these action groups, it will decide which action groups to use and then ask us for confirmation. Confirm these.

After invoking these two action groups and corresponding Lambda functions, the agent returns the response.

If you click on "Show trace," you can see the thought process and steps used by the agent. It shows two trace steps, which you can expand to see the details.

Let's take it one step further. Prompt to create an EC2 instance:

Create a t2.nano EC2 instance in ap-south-1 region.

In the AWS console, you can also see the same instance ID and IP address.

In the trace, we can see how the agent has identified the correct action group and invoked it.

As I had provided all required parameters in the prompt, it was able to directly invoke the third Lambda. Otherwise, it will ask for parameters. Let's try that by giving the prompt: "create an ec2 server."

Part 2 — Agentic AI Web App for End Users

The article above outlined how to create a powerful agent that can manage AWS resources using natural language. If we want to use this for regular use, for example by operations teams, we can create a web app around it with a nice web interface. This allows users to interact with the agent without needing to access the Bedrock console.

In this section, we will create a simple web app around our agent. The full app code is present here, and it can be cloned:
https://github.com/sauveerk/projects/tree/main/Code/WebApp-Bedrock-Agent

Step 1: Modify Action Group Behavior and Create Agent Alias

We had enabled the confirmation to test the behavior of our agent by selecting "Enable confirmation of action group function" for action groups that list S3 and EC2 servers. Let's disable this to simplify our web app code.

Go to the action group that lists EC2 servers. Check the "disabled" button and click save. Verify that this is also disabled for other action groups. Save the agent and prepare it.

To use the agent programmatically, we need to create an alias. On the agent page, under the "Alias" section, click "Create." Give it a name and keep the default choice — "Create a new version and associate it to this alias." Click on "Create alias."

Click on the newly created alias, and it will open the alias page. Use the "Test" button to verify that you are not being asked for confirmation.

Step 2: Python Program

This is the Python code using which we can invoke our model programmatically:

import boto3
import json
from botocore.exceptions import ClientError
from typing import Optional, Dict, Any
import uuid

def invoke_bedrock_agent(prompt: str) -> Optional[Dict[Any, Any]]:
    """
    Invoke a Bedrock agent with the given prompt.

    Args:
        prompt (str): The prompt to send to the agent

    Returns:
        Optional[Dict]: The agent's response or None if an error occurs
    """
    try:
        bedrock_agent_runtime = boto3.client('bedrock-agent-runtime')
    except Exception as e:
        print(f"Error creating Bedrock client: {str(e)}")
        return None

    if not prompt or not isinstance(prompt, str):
        print("Error: Prompt must be a non-empty string")
        return None

    try:
        session_id = str(uuid.uuid4())
        response = bedrock_agent_runtime.invoke_agent(
            agentId='<YOUR_AGENT_ID>',
            agentAliasId='<YOUR_AGENT_ALIAS_ID>',
            sessionId=session_id,
            inputText=prompt
        )

        # Handle the event stream response
        full_response = ""
        for event in response['completion']:
            if 'chunk' in event:
                chunk_data = event['chunk']['bytes'].decode('utf-8')
                full_response += chunk_data

        # Parse the complete response if needed
        if full_response:
            try:
                return json.loads(full_response)
            except json.JSONDecodeError:
                # If response is not JSON, return as plain text
                return {"response": full_response}
        else:
            print("Error: Empty response from Bedrock Agent")
            return None

    except ClientError as e:
        print(f"AWS API Error: {str(e)}")
        return None
    except Exception as e:
        print(f"Unexpected error: {str(e)}")
        return None

def main():
    prompt = input("Enter your prompt: ")
    response = invoke_bedrock_agent(prompt)

    if response:
        print("Agent Response:")
        if isinstance(response, dict):
            print(json.dumps(response, indent=2))
        else:
            print(response)
    else:
        print("Failed to get response from agent")

if __name__ == "__main__":
    main()

Let's test the program. We can see that it responds correctly.

Step 3: Create Python Flask Web App

Use this project structure:

your_project_directory/
├── app.py
├── agent_invoke.py
└── templates/
    └── index.html

This is the code for app.py:

from flask import Flask, render_template, request
from agent_invoke import invoke_bedrock_agent
import json

app = Flask(__name__)

def extract_response_text(response):
    try:
        if isinstance(response, str):
            response_dict = json.loads(response)
        else:
            response_dict = response

        # Check for 'response' field in the dictionary
        if 'response' in response_dict:
            return response_dict['response']
        # Check for 'completion' field as fallback
        elif 'completion' in response_dict:
            return response_dict['completion']
        # Return the full response if no known fields are found
        return str(response_dict)
    except Exception as e:
        return f"Error processing response: {str(e)}"

@app.route('/', methods=['GET', 'POST'])
def home():
    response = None
    if request.method == 'POST':
        user_prompt = request.form.get('prompt')
        if user_prompt:
            raw_response = invoke_bedrock_agent(user_prompt)
            response = extract_response_text(raw_response)

    return render_template('index.html', response=response)

if __name__ == '__main__':
    app.run(debug=True)

This is the code for index.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Bedrock Agent Interface</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
        }
        .container {
            display: flex;
            flex-direction: column;
            gap: 20px;
        }
        textarea {
            width: 100%;
            min-height: 100px;
            padding: 10px;
            margin: 10px 0;
        }
        button {
            padding: 10px 20px;
            background-color: #007bff;
            color: white;
            border: none;
            border-radius: 4px;
            cursor: pointer;
        }
        button:hover {
            background-color: #0056b3;
        }
        .response-box {
            border: 1px solid #ccc;
            padding: 10px;
            min-height: 100px;
            white-space: pre-wrap;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1>Bedrock Agent Interface</h1>

        <form method="POST">
            <label for="prompt">Enter your prompt:</label>
            <textarea name="prompt" id="prompt" required>{{ request.form.get('prompt', '') }}</textarea>
            <button type="submit">Submit</button>
        </form>

        {% if response %}
        <div>
            <h2>Agent Response:</h2>
            <div class="response-box">{{ response }}</div>
        </div>
        {% endif %}
    </div>
</body>
</html>

Step 4: Test the Web App

Go to the project directory and run the program using:

python app.py

This will start a local server. You can access the app on localhost:5000.

Let's try the prompt "How many ec2 servers I have." Press submit. You can see the response on the web page.

Let's try another prompt: "Create an ec2 server of type t2.micro in ap-south-1 region."

We can see that the server has been created. Let's verify it in the AWS console.

If you check again, you will now have two servers.

Step 5: Clean Up

Remember to clean up the resources you created to avoid incurring unnecessary costs:

Terminate any EC2 instances created during testing
Delete the Bedrock Agent

Step 6: Additional Considerations

Prompt Quality: The quality of your prompt significantly impacts the agent's ability to correctly understand user intent and use the right tools. Be clear and specific.

Throttling: As there are service quotas applicable to the LLMs, you might face throttling errors. Check the service quota, wait, and retry.

User Interface: Creating custom web applications that leverage Bedrock Agents for AWS operations is a very compelling use case for many organizations. It enhances accessibility by enabling users who are not experts in the AWS console, CLI, or SDKs to perform pre-defined operational tasks. These can also assist technical teams and increase operational efficiency.

Customization vs. Amazon Q: Amazon Q in the console is a great starting point and provides general AWS assistance. AWS will keep integrating more functionality into Q in console and also Gen AI capabilities in other services. However, when you define action groups and knowledge bases for your agent that are unique to your organization, you can provide a tailored user experience and workflow integration. A custom web application built on Bedrock Agents allows you to create a highly specialized, integrated, and controlled operational tool tailored precisely to your organization's unique environment, processes, and user needs, going beyond the generic capabilities of a console-level assistant. It's about building your organization's intelligent assistant for its specific operational challenges.

This article demonstrates how to build intelligent AWS automation using Amazon Bedrock Agents with custom web interfaces.

The Hidden Realities of Cloud Migration: Lessons from the Trenches

Sauveer Ketan — Sun, 13 Jul 2025 19:06:03 +0000

In theory, cloud migration is straightforward. You assess, plan, and execute. In practice, however, it's far more complex. While playbooks provide a solid framework, the real world often throws situations that demand adaptability and human intervention.

The Textbook View: Assess, Mobilize, Migrate

AWS's migration playbook outlines three distinct phases:

Assess: Understand your current environment and build a compelling business case for migration.
Mobilize: Prepare your AWS foundation, define your architecture, and finalize your migration plan.
Migrate: Execute the move of your workloads and data to AWS. In this phase, migration is divided into two stages, initialize and implement.

This structured approach sounds simple enough. Yet, once you start planning, you quickly realize that cloud migration success isn't just about technology; it's equally about people, processes, and preparedness.

When service providers present their proposals to clients, this difference in exposure often leads to heated discussions. Clients with no prior migration experience try to downplay the role of "known unknowns" and "unknown unknowns", they look for shorter timeline with fewer number of people. While the service providers with prior experience try to accommodate these unknowns based on their prior experience.

Tools

AWS provides a robust suite of tools to streamline cloud migrations, encompassing discovery, planning, and execution phases. Migration Evaluator, used in the initial Assess phase, is crucial for building a data-driven business case. AWS Migration Hub acts as a central console, offering a unified view of your migration progress and integrating with various services. For detailed assessment and dependency mapping, AWS Application Discovery Service helps gather crucial information about on-premises servers, applications, and their interdependencies. When it comes to the actual "lift-and-shift" of servers and virtual machines, AWS Application Migration Service (AWS MGN) is the go-to solution, automating server replication and cutover with minimal downtime, making it efficient for migrating diverse workloads, including those you might find with legacy systems.

The Cloud Migration Factory on AWS solution is designed to coordinate and automate manual processes for large-scale migrations involving a substantial number of applications. This solution helps enterprises improve performance and prevents long cutover windows by providing an orchestration platform for migrating workloads to AWS at scale.

AWS Transform for VMware is a newer service in the toolkit. It is an agentic AI service which automates application discovery and dependency mapping, network translation, wave planning, and server migration while optimizing EC2 instance selection to accelerate VMware workloads migration.

While AWS provides a robust suite of native tools, the broader cloud migration ecosystem also includes a variety of AWS Partner solutions. Cloudamize, modelizeIT, Flexera, RiverMeadow, etc., are a few of them.

What Really Happens: Real-World Cloud Migration

Despite rigorous preparation, detailed runbooks, and sophisticated migration tooling, we've consistently encountered numerous challenges. Here are few concrete lessons learned from actual cloud migrations (many of them) — issues that only became clear through firsthand experience. Here my focus is on rehost (lift-and-shift) migrations only.

1. Gaps in Dependency Mapping

Even with advanced automated discovery tools and thorough application team walkthroughs, some critical interdependencies inevitably slipped through the cracks. These hidden connections became glaringly apparent only during the high-pressure cutover windows.

Lesson: Always supplement automated discovery tool outputs with detailed, in-depth application interviews and meticulous system-level dependency validation.

2. Overprovisioning and Cost Optimization

A common initial misstep was placing all production databases on expensive io2 volumes, based on an assumption of high IOPS needs. In reality, most systems didn't require such high performance.

Mid-migration, we shifted our default storage strategy to gp3. Post-migration, we diligently monitored actual IOPS metrics and only upgraded volumes where necessary. We also planned and migrated the io2 volumes back to gp3, which is easy to do in AWS.

On similar lines, application teams sometimes want to have as much CPU and RAM as on-premises, fearing application performance degradation. It should be based on rightsizing recommended by the assessment tools, and decision should not be left in the hands of application teams (this requires senior stakeholders' buy-in and support). Resizing ec2 instances is easy and quick in AWS if needed post-migration.

Recommendation: Baseline your actual IOPS needs using actual metrics and avoid making assumptions about storage requirements. Rightsize ec2 instances based on assessment tool recommendation.

3. COTS Applications Can Be Complicated

Commercial Off-the-Shelf (COTS) applications often introduce unique hurdles:

Some had unsupported licensing models in AWS.
Others demanded minimum CPU or RAM allocations and did not work. Despite low actual utilization and hence smaller ec2 size recommended by the assessment tools.
Certain applications, like Tableau, could not be lifted and shifted directly due to architectural or licensing constraints.

Takeaway: Thoroughly review vendor support statements and validate technical feasibility with Proof of Concepts (PoCs) early in the project lifecycle. Discuss the plan with the vendor and engage them for migration window. Involve application teams during test cutovers, full-fledged testing might not be possible at this stage, see if some sanity tests can be performed.

4. Unexpected Machine Password Resets on Windows

A subtle yet disruptive issue involved monthly Windows machine password resets. Systems failed to join Active Directory during migration if their machine account password changed during the cutover window.

Fix: Implement pre-checks for password age and force resets before migration where necessary to ensure smooth domain joining.

5. Third-Party Tooling and Licensing

Many organizations rely on various third-party tools (monitoring, security, etc.) Tools crucial for post-migration access and verification, such as BeyondTrust PowerBroker, often had limited number of licenses and reassignment were required. This bottleneck caused significant delays in validation efforts.

Lesson: Proactively align tool licensing with peak migration activity requirements to prevent unexpected hold-ups.

6. fstab Issues in Linux Systems

Outdated fstab entries for outdated NFS mounts or Universally Unique Identifiers (UUIDs) for disks might lead to boot failures in Linux systems. In some cases, manual intervention via rescue mode was required.

Recommendation: Gracefully reboot servers few days prior to cutover. This simple step can surface latent boot-time issues before they impact your migration window.

7. On-Premise NFS Mounts Introduced Latency

We discovered that some applications, after migration, were referencing static content over Network File System (NFS) mounts from on-premise NFS servers. This introduced significant latency, impacting performance.

These mounts had to be migrated to AWS services like Amazon EFS or FSx. AWS Storage Gateway is also an option for hybrid cloud scenarios where some on-premise data might still need to be accessed. In some instances, we needed to build fallback mechanisms directly into the applications. In other cases, applications had to be rolled back, to be migrated later.

Recommendation: This takes us back to first point above, i.e., dependency mapping and proper testing before migration where needed.

8. Anti-Virus Interference with NFS

In one migration, a perplexing performance issue arose when anti-virus software was found to be scanning NFS-mounted directories. While ping and traceroute showed no network issues, application performance dropped dramatically. Because whenever anything was being uploaded to NFS server, anti-virus software was scanning it. The servers running anti-virus software were low on memory, a graceful reboot, fixed the issue. It took few hours and multiple people to figure the issue, as no single person had access to all the components involved. This is an excellent example of "unknown unknowns."

Mitigation: Support team incorporated in their playbook monthly anti-virus server reboots to alleviate this subtle yet impactful problem. Similar measures will be useful for other centralized tooling servers.

9. Legacy Systems Need Special Handling

Older systems, such as Windows 2008 servers, RHEL 5, etc., presented unique challenges. They could be migrated to older Xen based hypervisor only and hence had fewer options for ec2 servers. In a very special case, we migrated a Windows 2003 server with 1 GB of RAM. There was limited support available for these legacy OS and sometimes it was multiple hit and trials to successfully migrate them. Some configuration changes might also be required, for example, installing ENA drivers on RHEL 6 servers if we want to put them on nitro instances — missing which might lead to issues and troubleshooting.

If OS upgrade is an option, better consider that during migration planning rather than relying on legacy OS.

10. Decommission

In one migration, after rigorous assessment, it came out that out of around 500 servers, around 80 could be decommissioned. Leading to huge savings. Here is an interesting idea — even if you are not migrating to cloud in near future, why not do a full-fledged assessment periodically to realize such waste of resources.

Retire is one of the 7 Rs of migration (Retire, Retain, Rehost, Relocate, Replatform, Repurchase, and Refactor) and a very important one.

Operational & Technical Observations

Beyond specific application-level issues, several broader operational and technical challenges emerged:

AWS Limits: In few migrations, we were migrating hundreds of servers per week. We frequently encountered AWS service limits, including those for snapshots, API calls, and AWS MGN (or CloudEndure before that). This necessitated requesting additional accounts and quota increases. Plan for these in advance and get them increased beforehand.
Disk Attachment Limits: Some source systems exceeded EC2 disk attachment limits, requiring architectural restructuring to accommodate this. This is an edge case, but often these systems are critical systems. It should be taken as a key consideration during source assessment and target architecture design within the Mobilize phase.
Oracle ASM FD: We had multiple Oracle ASM FD servers. During one of the migrations after troubleshooting with AWS, it came out that while Oracle ASM (Automatic Storage Management) was supported by CloudEndure, not ASM FD (Filter Drive). AWS provided amazing support and gave a fix for this after few weeks, and we were able to successfully migrate these servers.
F5 iRules and ALB: Load balancer behaviors differed between on-premises F5 iRules and AWS ALB. During planning it came out that this will lead to refactoring of the applications. One example of this is, client IP handling, which was required by few applications directly, and it was working fine with on-prem F5 which was using direct pass-through, but ALB puts them in x-forwarded-for headers. These kind of scenarios provide opportunity to adopt other cloud-native features and modernize the applications.
Hardcoded IPs: A common problem, especially in development environments, was hardcoded IP addresses within TLS certificates and application configurations, complicating the migration process. These were often noticed after the migration.
ENI Pre-Provisioning: In specific scenarios, Elastic Network Interfaces (ENIs) had to be pre-created and preserved to ensure static IPs are known beforehand, so that they can be configured in firewalls and load balancers to avoid downtime. If this step is missed, it might lead to backout of the applications.

Process and Collaboration Insights

Effective process and strong collaboration were vital to navigating these complexities:

Weekly Lessons Learned (LL) calls: We instituted weekly LL calls every Tuesday post-migration to review and capture insights.
Centralized Documentation: All issues, along with their resolutions, were meticulously documented and stored in a central SharePoint portal for easy access and reuse.
Mandatory Review: All migration engineers were required to read, contribute to, and reuse this living documentation. They were randomly asked to give a walkthrough of these.
AWS TAM Coordination: Close coordination with AWS Technical Account Managers (TAMs) proved invaluable in resolving roadblocks and accelerating issue resolution.
Runbook Updates: Runbooks were continuously updated after each migration wave, incorporating real-world field feedback.
Roster Updates: Multiple teams need to be available during migration window — various infra support teams, application teams, vendor support for COTS applications, etc. Based on lessons learnt we rigorously updated our rosters and ensured the engagement.

Recommendations for Future Migrations

Based on our experiences, here are key recommendations for any organization embarking on a cloud migration journey:

Expect Surprises: Always anticipate the unexpected, especially when dealing with legacy configurations and potential human errors.
Create Buffer Bandwidth: Build in buffer capacity for both your engineering team and your project schedule to absorb unforeseen challenges. A single issue, and one of your engineers might be engaged for hours to fix that. While you had planned that she will handle 10 servers during migration!
Make Graceful Pre-Migration Reboots Standard: Implement pre-migration reboots as a standard procedure to surface latent boot-time issues before they impact your cutover window.
Validate Tools and Licensing: Thoroughly validate all tools and their licensing requirements for both pre- and post-migration activities.
Document Every Issue: Treat every encountered issue and its fix as a critical piece of your live migration playbook. Document everything diligently. Diligently have lessons learnt discussions after every wave. Update your runbooks and rosters accordingly.

Final Thoughts

Cloud migration is more than just a technological shift; it's a comprehensive change management initiative that impacts systems, teams, and deeply ingrained assumptions. No matter how meticulously you plan, real-world migrations will invariably expose issues that only practical experience can help you resolve.

The key to a successful migration lies in your ability to focus on learning, diligent documentation, and constant adaptation. That's what truly transforms a good migration strategy into a great one.

What are some unexpected challenges you've faced during cloud migrations, and how did you overcome them?