DEV Community: Rohini Gaonkar

I Switched to the Agent Toolkit for AWS. Here's Why.

Rohini Gaonkar — Fri, 12 Jun 2026 16:45:05 +0000

I've been using AI coding agents like Kiro, Claude Code, with AWS for a while now. To connect them to my AWS account, I was running the community MCP servers from awslabs; the AWS one, the documentation one, sometimes both.

It worked. But it felt like handing my house keys to a very enthusiastic intern and hoping they didn't rearrange the furniture while I was out. The agent had my credentials but no restrictions on what it could do, and zero audit trail of what it actually did.

Then I switched to the Agent Toolkit for AWS. It's the difference between that enthusiastic intern and a contractor who shows up with their own tools, follows the scope you agreed on, and leaves you a detailed invoice of every change they made.

What is it?

The Agent Toolkit for AWS is the official, AWS-managed suite of tools that helps AI coding agents build, deploy, and manage things on AWS. Four components:

AWS MCP Server : a managed remote server that gives agents secure access to AWS APIs via the Model Context Protocol
Skills : curated step-by-step workflows for specific tasks (deploying serverless apps, debugging Lambda cold starts, etc.)
Plugins : single-install packages that bundle MCP config + skills for your IDE
Rules files : project-level configuration to guide agent behavior

Why I switched

Here's the thing. The old community servers were fine for experimenting. But the moment I started trusting agents with real infrastructure, I needed more control.

Security that actually means something.

The managed AWS MCP Server supports IAM condition keys. I can restrict exactly which actions an agent can perform. Scope the IAM role down to the minimum permissions the agent needs for the task, and it can only operate within those bounds.

The MCP Server automatically tags every request with condition keys (aws:CalledViaAWSMCP). So you can write IAM policies that treat agent actions differently from your own. For example, this would prevent the agent from deleting buckets, even if your credentials normally allow it:

{
  "Effect": "Deny",
  "Action": "s3:DeleteBucket",
  "Resource": "*",
  "Condition": {
    "Bool": {
      "aws:CalledViaAWSMCP": "true"
    }
  }
}

You still have full access. The agent doesn't.

Even better: use a separate IAM profile for your agent with only the permissions it needs. The condition keys are a safety net, but a scoped-down profile is the first line of defense. And if you're just getting started, point it at a dev account, not production.

Sandboxed code execution.

The toolkit includes a sandboxed Python runtime with boto3 access. Agents can write and run multi-step scripts, list resources, filter, aggregate, without touching my local machine.

The agent wrote a boto3 call, ran it remotely, and got structured results back. My machine never ran that code.

I can see what it did.

Every API call goes through CloudTrail. Metrics flow to CloudWatch. I get a full audit trail. With the old server, I'd have to dig through terminal history and hope I caught everything.

Every MCP-initiated call shows invokedBy: aws-mcp.amazonaws.com in the event fields. When you call aws s3 ls from your terminal, the sourceIPAddress would be your IP. When the MCP Server makes the call, it's aws-mcp.amazonaws.com. That's how you tell them apart.

Built-in docs search.

No more running a separate documentation MCP server. The Agent Toolkit has native tools to search AWS docs, read full pages, get content recommendations, and check regional availability. All in one server.

Expert skills.

These are curated workflows that go beyond documentation. Decision frameworks, troubleshooting trees, step-by-step procedures. For example, the aws-serverless skill covers Lambda, API Gateway, Step Functions, EventBridge, SAM, and CDK with guidance on cold starts, CORS debugging, concurrency, and production readiness.

We will explore these in future posts!

Multi-profile support.

If you work across multiple AWS accounts, there's built-in profile switching. Pass --profile in the config and the agent routes requests through the right credentials, check setup guide below on how to do this.

Side by side

Let the table do the talking:

	Old (`awslabs.aws-api-mcp-server`)	New (Agent Toolkit `aws-mcp`)
Type	Community/labs, runs locally	Official AWS-managed remote server
Auth	Local credentials, no restrictions	SigV4 + IAM condition keys
Security	No guardrails	Fine-grained IAM controls
Observability	None	CloudWatch + CloudTrail
Code execution	Not available	Sandboxed Python with boto3
Skills	Not included	Curated expert workflows
Documentation	Needed separate server	Built-in search + read
Maintenance	Manual `uvx` updates	AWS-managed, always current
Multi-profile	Not supported	Built-in

Getting started

If you're still running the old community MCP servers, the switch took me about five minutes. Try it and let me know what you think.

Prerequisites

AWS CLI v2.32.0+ installed
uv installed (for the proxy)
Valid AWS credentials

The Agent Toolkit itself is free. You only pay for the AWS resources your agent provisions or interacts with, at standard pricing. There are default quotas to be aware of, the main one being 3 requests per second per account. Fine for individual use, but worth knowing if you have multiple agents running in the same account.

Disable conflicting servers

If you have any of the old awslabs MCP servers configured (like aws-mcp-server or aws-documentation-mcp-server), disable them to avoid tool conflicts. You can always re-enable them later if you need to compare.

MCP Configuration

Using Claude Code, Cursor, or something else? Check the GitHub repo for setup instructions across platforms.

For Kiro, add this to ~/.kiro/settings/mcp.json:

{
  "mcpServers": {
    "aws-mcp": {
      "command": "uvx",
      "timeout": 100000,
      "transport": "stdio",
      "args": [
        "mcp-proxy-for-aws==1.6.0",
        "https://aws-mcp.us-east-1.api.aws/mcp",
        "--metadata", "AWS_REGION=us-west-2"
      ]
    }
  }
}

Two regions in this config. The endpoint in the URL (us-east-1 or eu-central-1) is where the MCP Server itself runs. AWS_REGION is where your AWS resources live, set it to the region you work in. So, change the AWS_REGION fr your workloads.

If you use a named profile:

"args": [
  "mcp-proxy-for-aws==1.6.0",
  "https://aws-mcp.us-east-1.api.aws/mcp",
  "--metadata", "AWS_REGION=us-west-2",
  "--profile", "your-profile-name"
]

Verify

Ask your agent: "List my S3 buckets", if it works, you're set.

🫣 Yes, I need to clean up my buckets, again!

Links

Have you made the switch yet? Tell me your experience.

Follow along

How to make AI answer questions about your documents

Rohini Gaonkar — Thu, 11 Jun 2026 18:41:25 +0000

In the previous post, we talked about context windows. The model has a fixed-size desk and everything has to fit on it at once. When too much is on the desk, things in the middle get missed.

I ended that post with a promise: what if there was a way to give the model just the right piece, at the right time, from a document you've never even pasted in?

That's this post. We're giving the model a search system.

The problem: your document is too long

You have a 2000-page document. An employee handbook, a product manual, internal documentation. You need one specific answer from it.

You can't paste the whole thing into the model's context window. And even if you found a model with a window big enough, we learned what happens: attention degrades, things in the middle get missed, and the model answers confidently from the wrong section.

So you need something different. A step that happens before the model sees anything. Something that finds the 2-3 paragraphs that actually answer your question, and passes only those to the model.

That's retrieval. The full technique is called RAG: Retrieval-Augmented Generation. Search first, then generate.

Retrieval-Augmented Generation

Let's break the name down. Each word is a step.

Retrieval.
Go find relevant information. Think of it like checking the index of a textbook before diving into a chapter. You don't re-read the whole book. You find the right page first.

Augmented.
Add that retrieved info to the prompt. You're supplementing the model's built-in knowledge with fresh, specific context. Like handing someone a cheat sheet right before they answer a question.

Generation.
The model writes its response, but with the retrieved context sitting right there in the conversation. It generates an answer grounded in your actual data, not just its training. "Grounded" means the model has real evidence to point to. It's not guessing from memory. It's answering from something you gave it.

The whole loop in one sentence: find the right chunks of information, stuff them into the prompt, let the model answer using that context. That's it. That's RAG.

And if you're thinking "wait, isn't this just enterprise search?" you're not wrong.

Tools like Elasticsearch, Kendra, SharePoint search have been finding relevant passages in documents for decades. The retrieval part isn't new. What's new is the last step: instead of showing you a results page to read for yourself, a foundation model reads the evidence and writes the answer.

To put it simply, RAG is enterprise search with a language model at the end of the pipeline.

The setup: onboarding docs for a fictional company

Imagine you just joined a new company and on the first day they hand you a bunch of documents. Employee handbook, benefits guide, leave policy, expense rules, engineering onboarding, IT security. Six documents with thousands of lines. All the answers are in there somewhere, but you'd have to read all of them to find what you need.

I've got a fictional company here, PineRidge Solutions. These are their onboarding docs.

The goal: I type a question like "how many vacation days do I get?" or "what's the parental leave top-up?" and the system finds the right section and answers from it.

I'm building this in Kiro IDE, and for the models, I'm using Amazon Bedrock, the same tool we've been using for the last four posts. Except now, instead of the Playground in AWS Console, I'm calling it through my code.

Please note, I'm using Bedrock here, but this same pattern works with any embeddings model locally or on Cloud. Ollama locally, OpenAI, Cohere, whatever. The pipeline is the same. The model is just a plug.

All the code mentioned in this post is available in my GitHub repo here.

Three steps to build. Chunk, embed, retrieve. Let's go.

Step 1: Chunk the document

Before anyone can search these documents, they need to be broken into smaller pieces. Chunks. Usually a few paragraphs each.

Why? Because the goal is to return just the relevant section, not everything. If I keep each document as one giant block, the search will return entire files when I only need a paragraph.

How you split matters. Too large, and you're back to the "too much context" problem. Too small, and you might cut an answer in half.

Let's take a simple example.

Say the leave policy has three sentences: "The standard vacation policy grants 15 days per year. However, employees in their first year receive only 10 days. These days do not carry over into the next calendar year."

If I chunk without overlap, I might split after the second sentence. The next chunk starts with "These days do not carry over into the next calendar year."

Now if someone asks "do my vacation days carry over?" the system retrieves that chunk. It answers "these days do not carry over." But which days? The standard 15? The first-year 10? The word "these" has lost its referent. The chunk is meaningless on its own.

With overlap, the last sentence of chunk one repeats at the start of chunk two. Both chunks make sense independently.

Here's the code:

def chunk_docs_paragraph(folder: str) -> list[dict]:
    """Paragraph-based chunking with 1 paragraph of overlap."""
    chunks = []

    for filename in sorted(os.listdir(folder)):
        if not filename.endswith(".md"):
            continue

        with open(os.path.join(folder, filename), "r") as f:
            text = f.read()

        # Split document into paragraphs (separated by blank lines)
        paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

        for i in range(len(paragraphs)):
            # Include 1 paragraph of overlap for context continuity
            start = max(0, i - 1)  
            chunk_text = "\n\n".join(paragraphs[start : i + 1])

            # Store the chunk text and which file it came from (for citations)
            chunks.append({"text": chunk_text, "source": filename})

    return chunks

The funtion loops through every markdown file in the folder, reads it, and splits on blank lines to get paragraphs. Then for each paragraph, it includes one paragraph of overlap, the one before it, so nothing gets lost at the boundary. Each chunk gets stored with the text and which file it came from, so later I know where the answer originated.

From six onboarding documents, I get about 150 chunks. Each one is roughly a paragraph or two. A self-contained piece of text.

Step one done. Now I need to make these searchable.

Step 2: Turn chunks into embeddings

Here's the concept that makes the whole thing work. Each chunk gets turned into a set of numbers called an embedding.

The name is a literal mathematical term. You're taking text and placing it into a space made of numbers. In that space, distance has meaning. Two chunks about similar things end up close together. Two chunks about different topics end up far apart.

"Parental leave top-up" and "salary during maternity leave" would be near each other numerically, even though the actual words are completely different. That's what makes this useful: an embedding captures meaning, not exact words.

Think of it like a library's index card system. The card doesn't contain the whole book. It captures enough about the content to help you find the right book when someone asks.

A specialised model called an embeddings model does this conversion for us. It's not the same model that generates your answer. It's a different model for a different job. The embeddings model is small and fast. It turns text into searchable numbers.

import boto3, json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def embed_text(text: str) -> list[float]:
    """Call Titan Embeddings V2 to get a 1024-dim vector."""
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text}),
    )
    result = json.loads(response["body"].read())
    return result["embedding"]

Each chunk now has a numerical fingerprint. That's my searchable index.

Now you'll hear the term "vector" a lot. It just means a list of numbers with a direction. Think of it as coordinates.

An embedding is the concept, a vector is the format it's stored in.

Right now these vectors are sitting in a Python list on my laptop. If I close this script, they're gone. For this demo, I'm caching them to a local file so I don't re-embed every time I run the script. But for a production system with thousands of documents, you'd store them somewhere proper. AWS recently launched Amazon S3 Vectors, which is literally what it sounds like: S3 built for storing and searching vectors natively. There's also OpenSearch Serverless, pgvector if you want Postgres, or Amazon Bedrock Knowledge Bases which handles the whole pipeline as a managed service.

Step two done. Now, the search.

Step 3: Retrieve and Generate

Someone asks a question. The question gets embedded with the same model. Same kind of numbers. Then we compare the question's numbers against all the chunk numbers. The closest matches are my search results.

This is semantic search. It matches by meaning, not by exact words.

If the handbook says "remote work policy" and I ask about "working from home rules," it catches the match because the meaning is close.

import numpy as np

def retrieve(question: str, chunks: list[dict], embeddings: np.ndarray, top_k: int = 3):
    """Find the top-K most relevant chunks via cosine similarity."""

    # Embed the question into the same vector space as our chunks
    q_vec = np.array(embed_text(question))

    # Compare question vector against every chunk vector
    scores = []
    for i in range(len(chunks)):

        # Cosine similarity = dot product / (magnitude_a * magnitude_b)
        score = np.dot(q_vec, embeddings[i]) / (
            np.linalg.norm(q_vec) * np.linalg.norm(embeddings[i])
        )
        scores.append(score)

    # Sort by score descending, take top K
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [chunks[idx] for idx in top_indices]

The retrieve function. It takes the question, embeds it with the same Titan model, so it's in the same number space as the chunks. Then it compares the question's numbers against every chunk's numbers using cosine similarity, which is just a way to measure how close two vectors are. Score of 1 means identical, 0 means completely unrelated. It sorts by score and returns the top 3.

The top 3 chunks are my evidence. Now I pass them to a generation model alongside the question. Titan did the embeddings. Claude does the answering.

def generate_answer(question: str, retrieved: list[dict]) -> str:
    """Pass retrieved chunks + question to Claude."""

    # Format retrieved chunks with their source for traceability
    context = "\n\n---\n\n".join(
        f"[Source: {r['source']}]\n{r['text']}" for r in retrieved
    )

    # System-style instruction followed by context and question
    prompt = (
        f"You are answering questions about PineRidge Solutions' company policies. "
        f"Use ONLY the context below. If the answer isn't there, say so.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )

    # Call Claude via Bedrock's Converse API
    response = bedrock.converse(
        modelId="us.anthropic.claude-haiku-4-5-20251001-v1:0",
        messages=[{"role": "user", "content": [{"text": prompt}]}],
    )
    return response["output"]["message"]["content"][0]["text"]

The function generate_answer. It takes the retrieved chunks, labels each one with which file it came from, and builds a prompt. The prompt tells Claude: "You're answering questions about PineRidge company policies. Use ONLY the context below. If the answer isn't there, say so." Then it passes the context and the question to Claude via Bedrock's Converse API and returns the response.

I asked: "What's the RRSP matching policy?"

The system retrieved the right section from the benefits guide. The answer came back grounded in the actual policy document: dollar-for-dollar match up to 5% of base salary, starts after 90 days, vesting schedule. Not from the model's training data, from the company's files. And I can see exactly which chunks were used to build that answer. That's my citation. I can point to the source.

The full pipeline. Chunk, embed, retrieve, generate. Running on my laptop. About 60 lines of Python. And it works.

Where it breaks: a quick preview

So this works great when retrieval finds the right piece. But watch this.

I asked: "How many vacation days do I get as a senior engineer?" Retrieval actually works. It finds the vacation table from the benefits guide. But the model says "I don't know which level a senior engineer is." The right information was retrieved, but the answer needed two pieces of context that aren't in the same chunk: what level maps to "senior engineer," and how many days that level gets.

That's the kind of thing that breaks. Retrieval succeeded, but the answer still failed. The model wasn't hallucinating. It was honest about what it couldn't determine from the evidence it had.

This is not a hallucination in the way we talked about in the hallucinations post. The model didn't invent something from nothing. It was given real text from the real document. But the retrieved chunks didn't contain everything needed to answer the question.

When a RAG system gives you a bad answer, the question to ask is: "what chunk did it retrieve?" Not "why is the model wrong?"

We'll diagnose and fix this properly in the next post.

Key takeaways

If you're just getting started: RAG is how you get AI to answer questions about your documents without pasting everything into the chat. It searches first, then answers from what it finds. Three steps: chunk, embed, retrieve. The model never sees the full document. Just the pieces that match your question.

If you're more on the builder side: RAG is a pipeline with independently tunable steps. Chunking strategy, embedding model, retrieval method, and generation model each affect quality on their own. Also worth noting: different models for different jobs in the same pipeline. Titan Embeddings for search (fast, cheap). Claude for generation (smart, conversational). You'll see this pattern everywhere in AI systems.

What's next

So this works great when retrieval finds the right piece. But what happens when the chunks are too small and the answer gets cut in half? What if the question needs information scattered across multiple sections? What if retrieval succeeds but the answer still fails because context is split across chunks?

Next post, we break this thing on purpose. Then we fix it. And I'll walk through the full toolkit of strategies that make retrieval actually reliable.

Ride along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Why does AI forget what you said (and how to fix it)

Rohini Gaonkar — Mon, 25 May 2026 15:08:33 +0000

I received following comment on my hallucinations blog post.

Comment on Why does AI lie? Hallucinations explained simply

Joske Vermeulen May 9

Just yesterday I had Opus asking me after every prompt: we have been going for a long time, let me save my context and continue tomorrow 😂

Comment on Why does AI lie? Hallucinations explained simply

Joske Vermeulen May 11

:D I really answered every time, you are a computer, just continue. But it became even worse, so I needed to start a new session :)

The model basically raised its hand and said "hey, we've been at this a while." That's actually the best-case scenario.

A lot of models won't do that. They'll just silently get worse. Same confident tone, less reliable answers. You won't know it's happening until something is clearly wrong.

You paste a long document in, ask about something in the middle, and you get a confident answer that's wrong. Or you have a twenty-message conversation and the model starts contradicting itself.

Not because it's hallucinating. Because it's running out of room.

In the previous post, we talked about model sizes. Tokens were the unit of cost. Today they become the unit of memory.

What a context window actually is

Every model has a context window. That's the total number of tokens it can hold in its head at once. Your input, plus its output, all has to fit inside that window.

Think of it like a desk. A fixed-size desk. Everything the model needs to think about has to be on that desk at the same time. Your question. The document you pasted. The conversation history. The system instructions. All of it.

If you put too much on the desk, things start getting buried. The model doesn't tell you "hey, I can't fit all this." It just works with whatever it can focus on, and quietly loses track of the rest.

How big is the desk? Depends on the model.

Some older models had a context window of 4,000 tokens. That's roughly 3,000 words. About six pages.

Some have 128,000 tokens. That's a short novel.

Some newer models have a million tokens or more. That's multiple novels. Entire codebases.

But here's the thing most people miss. A bigger context window doesn't always mean the model pays equal attention to everything in it. It means more fits on the desk. It doesn't mean the model reads every page with the same care.

Two shapes of the same problem

Let's see this limit in two ways.

Documents

You paste twenty pages of text into a model. A legal contract, an insurance policy, internal documentation. You ask a question about something in section 7 of 15. The model might find it, it might miss it or it might pull from the wrong section entirely.

The more text surrounding your target information, the more the model's attention gets diluted. Even if the window isn't full.

Conversations

This is where most people hit it first, like the commenter above.

By default, the model doesn't have a separate "memory" for your conversation. Some products layer persistence on top (ChatGPT's memory, Claude's projects), but the model underneath still works the same way. Every single time you send a message, the model re-reads the entire conversation from the beginning. Your first message, its first reply, your second message, its second reply, all the way down to whatever you just typed.

That whole transcript gets fed back in every single time. And each exchange adds more tokens to the pile.

A typical question might be 50 tokens. The model's reply might be 300. So one exchange is 350 tokens.

Ten exchanges? 3,500 tokens.
Twenty exchanges? 7,000.

If you're asking detailed questions and getting long answers, you can hit 20,000 or 30,000 tokens in an afternoon.

And here's the catch, you're not just using up memory. You're re-sending and re-paying for the entire conversation history every single turn.

Tokens are the unit of memory and the unit of cost. Same resource, two consequences.

Models have gotten much better at handling long inputs. You can throw surprisingly large documents at them now. But the limit still exists. And the longer the input, the more likely something gets missed.

Lost in the middle

Researchers have a name for this. They call it "lost in the middle."

When you give a model a long input, whether that's a document or a conversation history, it tends to pay the most attention to two places: the very beginning, and the very end. The stuff in the middle gets less focus.

It's like reading a long email thread. You remember how it started. You remember the latest message. But that reply from Tuesday at 2pm that's buried fourteen messages deep? Good luck.

This is why things you said early in a conversation drift as the transcript grows. Your early messages end up in the middle of the window and the middle is where attention is weakest.

Most models won't warn you. They'll just give you the same confident tone whether they are working from a clear, focused input or they are drowning in context. The commenter's experience with Opus was the rare exception, not the rule.

What you can do about it

Bigger window

Use a model with a bigger window if you're hitting limits. A bigger window is like a bigger backpack. You can carry more. But that doesn't mean you can instantly find what you need. So the rest of these strategies still matter.

Chunk

Don't paste everything if you don't need everything.

If your question is about section 3, give it section 3. Not the whole document. Less noise, better signal.

Summarise

Summarise first, then ask.

If you need the model to work with a long document, ask it to summarise the document first. Then ask your real question against the summary. Two calls instead of one, but the second call has focused context. Just make sure the summary didn't leave out something important.

Position

Put the important stuff at the beginning or the end.

If you're writing a prompt that includes reference material, put your actual question at the very end. Or put the most critical context at the very beginning. Don't bury the important part in the middle.

Restate

Restate important constraints. If you told the model something critical in message one and you're now on message fifteen, say it again. Costs you a few tokens. Saves you a wrong answer.

System prompt

Use the system prompt for persistent rules. Most platforms have a place for instructions that consistently guide the model. In ChatGPT or Claude.ai it's called custom instructions. In Amazon Bedrock it's the system prompt field. Put your stable rules there, in clear, unambiguous language. But don't assume they'll be followed perfectly forever. In long conversations, repeating critical instructions in your current message still helps.

Fresh start

Start fresh when the conversation drifts. If you've been chatting for 20 turns and the topic has shifted three times, start a new conversation. Carry over what matters. Leave behind what doesn't.

Build your own memory layer

You can summarise older turns into a compact recap, store it somewhere (a database, a file, even a simple variable), and inject that summary at the start of each new call. That's essentially a DIY cache for conversation context. You can build a version tuned to what matters for your use case.

If you're a builder, this should feel familiar. We used to put Redis in front of Postgres so not every request hit the database. Same pattern here. Some platforms offer prompt caching where the system prompt or repeated context gets processed once and reused across calls instead of being re-tokenised every time. You're not re-paying for the same static context on every request. Same instinct, different layer: cache the expensive repeated work, only send the new stuff fresh.

If you want to dig deeper into this, read about prompt caching on Amazon Bedrock.

For documents, retrieval is the answer. Instead of stuffing the entire document into the context window, you retrieve just the relevant chunks and pass those in. That's what RAG (Retrieval-Augmented Generation) does, and we'll get to it in the next post.

Same principle for both: give the model less, but give it the right less.

Key takeaways

If you're just getting started: the model has a memory limit called a context window. It applies to documents and conversations equally. Longer inputs mean thinner attention. If you're pasting something long, ask about specific sections. If you're in a long conversation, restate the important stuff. And if things start feeling off, start a new session.

If you're more on the builder side: context window size is a spec, not a guarantee. A million-token window doesn't mean a million tokens of perfect recall. Put critical information at the edges, not the middle. For conversations, implement summarisation of older turns. And start thinking about retrieval, because that's where this is heading.

What's next

So the model forgets things when you give it too much. What if there was a way to give it just the right piece, at the right time, from a document you've never even pasted in yourself?

Next post, we're going deeper into retrieval. Giving the model just the right piece at the right time.

Ride along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Rohini GaonkarFollow

Love to share my experiences on building architectures with best practices, quick tips & tricks, cloud, AI, devops, and more.

Bigger AI models aren't always better. Here's how to actually choose.

Rohini Gaonkar — Fri, 15 May 2026 16:58:32 +0000

In the previous post, I showed you two models answering the same question. One hallucinated confidently. The other knew when to stop.

And a bunch of you asked: okay, but which one should I actually use?

Haiku, Sonnet, Opus. Micro, Lite, Pro. Mini, Small, Large. There are dozens of models and they all sound like perfume brands. How are you supposed to pick one?

That's this post. I'm going to take one prompt, run it through two models (one small, one large), and show you what's different. Then I'll give you a simple framework for choosing the right one.

The demo: same prompt, two models

I went back to the recipe from the first post. Same recipe. Same question. Two different model sizes.

The prompt:

"I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe for me, give me a shopping list, and a timeline starting from 4pm."

The small model

Quickly, it gave me a shopping list, a timeline, and basic adaptations. Nothing fancy, but everything I asked for. If I just need a quick answer and I'm going to double-check it anyway, this works.

For a lot of everyday tasks, this is genuinely all you need.

The large model

Same prompt but a very different response.

It added a whole "Strategy for the vegan guest" section explaining why you should make a parallel pot instead of adapting the main dish. It gave me a timeline starting from the night before. Separated prep into phases. Told me to keep the rice pots separate so nothing touches the vegan side. Scaling math for going from 4 servings to 6. It even gave me an oven method AND a stovetop method as alternatives.

More thorough and more considerate. But did I need all of that for a Saturday dinner? Maybe or maybe not.

There are medium-sized models in between these two and they exist in every family. I'll tell you when to reach for them later.

But the contrast between small and large is where our today's lesson lives.

Why models come in sizes

Let's take a simple example.

My son is two. If I ask him what he wants for dinner, he says "pasta." Done in two seconds without any deliberation.

If you ask me what to make for dinner, I'm thinking: what's in the fridge, what did we have yesterday, does he need more protein today, is it too late to start something that takes 40 minutes, should I batch-cook for tomorrow. Ten variables that will take me five minutes. I will give you a better answer, but slower.

Models work the same way.

A model's "size" is roughly how many parameters it has.

Think of parameters as the variables it can hold in its head when making a decision. More variables, more nuance, more ability to handle complex tasks. Fewer variables, faster and cheaper, but less sophisticated.

My son doesn't need ten variables to pick dinner. He just needs to decide. And for a lot of tasks, that's all you need from a model too. A fast answer. Not a perfect one.

Training a big model costs more. Running a big model costs more per question. And it's slower, because there are more variables to weigh for every single response.

So why not just always use the biggest one? Two reasons.

First, cost.

If you're building something that handles thousands of requests, the difference between a small model and a large model is the difference between a reasonable bill and a terrifying one.

Second, and this is the one people miss: bigger isn't always better.

For simple tasks, a big model can actually overthink it. Give you more than you asked for. Take longer to say something the small model said in two seconds.

The model families you see (Haiku/Sonnet/Opus, Micro/Lite/Pro) are just size tiers from the same provider. Same architecture, different capacity. Like buying a car in compact, sedan, or SUV. Same manufacturer. Different trade-offs. You don't take the SUV to grab milk. You don't take the compact on a cross-country road trip with three kids.

Tokens and pricing: how you actually pay

Models don't charge by the question. They charge by the token.

What's a token? It's a chunk of text. Not quite a word, not quite a letter, but roughly three-quarters of a word.

Take the sentence: "Adapt this recipe for a gluten-free vegan." Seven words but nine tokens. Some words get split, some punctuation becomes its own token.

You don't need to memorise this. Just know: token count and word count aren't the same thing. A full page of text is around 400 tokens. A million tokens is roughly a 750,000-word book.

There's a free tool called Tiktokenizer where you can paste text and see exactly how a model breaks it into tokens. It's weirdly satisfying. Try it.

One thing that surprised me: different models tokenize the same text differently. I sent the exact same prompt and recipe to both models. The small one counted 6,548 input tokens. The large one counted 16,685. Same words, different tokenizers under the hood.

And here's the thing: you get charged twice. Once for the tokens you send in (your question). And once for the tokens the model sends back (its answer). Input tokens and output tokens. They're priced separately, and output is always more expensive, because that's where the model is doing the work.

Real numbers

On Amazon Bedrock, for the Claude family (pricing as of May 2025, check current prices here):

Model	Size	Input (per 1M tokens)	Output (per 1M tokens)
Haiku	Small	~$1	~$5
Sonnet	Medium	~$3	~$15
Opus	Large	~$5	~$25

That's 5x more expensive from small to large. Same question, same answer, but 5x the price on both sides.

If you're asking one question yourself, who cares. The difference is fractions of a cent. But if you're building an app that handles ten thousand requests a day, each one generating a few hundred output tokens, that 5x multiplier turns into real money fast.

The best model is the model you can afford to run at the scale you need.

Where it breaks: when bigger is worse

Let's go back to the large model's response and look at the over-engineered parts. The timeline starting from the night before. "Marinate chicken in yogurt and spices, overnight is best." Fry the vegan portion first in clean oil, then fry the chicken onions in separate oil. Keep the rice pots separate. An oven method AND a stovetop method as alternatives.

The small model? A simple table. 4:00pm, start marinating. 4:05, fry onions. 5:15, into the oven. 7:00, serve. Done.

Opus is doing project management for my Saturday dinner. And here's the real cost of that overthinking.

The small model: 18 seconds, about 1,900 output tokens.
The large model: 44 seconds, 2,700 output tokens.

40% more output. 2.4x slower. And about 10x more expensive for that single request.

For a Saturday dinner, this is overkill. And if I were building an app that answers recipe questions for thousands of users, I'd be paying for all that extra thinking on every single request.

This is the trade-off. Bigger models are smarter, but "smarter" isn't always what you need. Sometimes you need fast, cheap, and good enough.

How to actually choose

Here's how I think about it.

First, the biggest factor: cost. We just saw a 5x difference between small and large. And that's per token. When the big model also generates 40% more tokens per response, it compounds fast. That alone narrows the field for most people. If you're building something, cost is the thing that decides what's even on the table.

If you can't afford to run it at the scale you need, it doesn't matter how good it is.

Start there. What can you actually sustain?

Then, once cost has set your boundaries, three questions help you pick within them.

1. How complex is the task?
Summarising an email? Small model. Writing a legal brief? Big model. Adapting a recipe? Probably medium.

2. How many times will you run it?
If it's one question from you personally, use whatever you want. If it's an app serving thousands of users, speed matters just as much as quality. Start small, upgrade only when the quality isn't good enough.

3. What are the stakes?
If a wrong answer ruins dinner, that's low stakes. If a wrong answer means bad financial processing logic that costs you millions, that's high stakes. Higher stakes, bigger model, plus verification on top.

That's it. Cost sets the ceiling. Complexity, volume, and stakes help you pick the floor. You don't need to memorise model names. You need to know what you're optimising for.

What about picking a provider?

I've been showing models from different providers. Claude, Nova, Llama. How do you pick a family?

Honestly? Pick the one that's available where you already work. If you're on AWS, you have access to all of them through Bedrock. If you're somewhere else, use what's there. The concepts are the same. Don't overthink the brand. Overthink the task.

One thing that confuses a lot of us early on: models and products are not the same thing. Claude is a model. But Claude inside Kiro (a coding IDE) behaves differently from Claude in the Bedrock Playground, which behaves differently from Claude on claude.ai. Same model underneath.

But each product wraps it with different instructions, tools, and context that shape how it responds. Kiro's Claude is tuned for writing code. The Playground's Claude is general-purpose. Same brain, different job description.

So when you see dozens of AI "products" out there, many of them are the same few models dressed up for different use cases. The model decides how smart it is. The product decides what it's pointed at, and priced accordingly.

Try it yourself

If you're just getting started: models come in sizes. Bigger is smarter but slower and more expensive. For most everyday tasks, a medium model is the sweet spot. Try a few and see which one feels right for what you're doing.

If you're more on the builder side: start with the smallest model that gives acceptable quality. Only upgrade when you can point to a specific failure the bigger model fixes. Don't start big and optimise down. Start small and justify up. And remember, you can use different models for different parts of the same system. The router doesn't need to be the same size as the reasoner. The model that decides which tool to call doesn't need to be the same one that processes the result.

Start small. Justify up.

What's next

We are going to talk why model forgets what you told it. Ride Along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Thank you for featuring me! 💜

Rohini Gaonkar — Mon, 11 May 2026 16:21:23 +0000

AI in retro assembly and VR coding setups

Jess Lee for The DEV Team

May 11

Why does AI lie? Hallucinations explained simply

Rohini Gaonkar — Fri, 08 May 2026 16:37:02 +0000

In the previous post, I showed you an AI doing something genuinely useful, helping me adapt a recipe for a dinner party. We talked about the basic loop: send a prompt to a foundation model, get a response.

Today we're talking about why AI lies to you.

You know how AI sounds confident when it's completely wrong? It's called hallucination, and it's the thing that'll either make you trust AI long-term, or burn you badly.

The demo: same question, two models

I asked two different models the same question in Amazon Bedrock Playground:

"What happened at the recent Lyrids meteor shower?"

Model 1: Amazon Nova Micro 1.0

Nova Micro gave me details. Dates, locations, numbers, all delivered with complete confidence. It didn't hesitate. It didn't caveat. It just answered as if it knew.

But it doesn't know. Its training data ends in 2023. Anything after that is a gap it can't see. It didn't flag that. It just filled the gap with something plausible.

This is hallucination. The model invents something plausible to fill a gap it doesn't know how to admit. It's not lying on purpose. It's doing exactly what it's designed to do: predict what a useful-sounding answer looks like. It has no idea whether the answer is actually true.

Model 2: Claude Haiku 4.5

Same question, newer model, much more recent training.

Haiku told me straight: "I don't have access to current information. My knowledge was last updated in April 2024." Then it offered general facts about the Lyrids and suggested I check recent astronomy websites.

Progress. Newer models are better at recognising the edges of what they know.

I gave it a link to a Space.com article. It told me it can't browse the internet.

So I uploaded the PDF of that website article. There are limits to how big the file size can be so I provided it first few pages only. Then it answered accurately, pulling real details from the source.

So, in this case, we provided some context to the model and it gave me an answer based on that context.

The biography test

I asked Nova Micro:

"Tell me about Rohini Gaonkar."

It didn't hesitate. It told me I'm a "well-known Indian writer, scholar, and cultural critic." That I got my PhD in Comparative Literature from Duke. That I'm a professor at the University of Minnesota. That I've edited influential anthologies on postcolonial theory.

None of this is true. Not one detail.

The model doesn't know who I am. But it knows what an academic biography looks like. So it generated one. Complete with research interests, notable works, and recognition. All fabricated. All confident.

So Haiku knew when to stop. Nova Micro didn't.

But the underlying mechanism is the same in both models: prediction.

One has better guardrails. The other just fills every gap it finds.

Hallucination isn't just about training cutoffs. It's about the model filling gaps anywhere in what it knows. Names it hasn't seen. Niche topics. Combinations it was never taught. Better guardrails help. They don't make the problem disappear.

A note on the name test: I used my own name on purpose. If the model invents something weird about me, the only person affected is me. Be thoughtful if you try this with other people's names, especially private ones, or anyone who hasn't agreed to be part of your experiment. Whatever the model says about them, you've just generated and potentially broadcasted it. So, be cautious.

Why this happens: the architecture

Remember the loop from the last post:

Input (prompt) → Foundation Model → Output (response)

The model predicts what a useful answer looks like, based on everything it learned during training.

During training is the key phrase.

Training ends on a specific date, called the training cutoff. After that, the model is frozen. When you ask it about anything past that date, or anything it never quite learned, it has two options: say "I don't know", or do the thing it's designed to do i.e. predict.

And for a long time, these models weren't great at saying "I don't know". That's not what they were rewarded for in training. They were rewarded for producing fluent, useful-sounding answers. So that's what they produce. Even when the answer is made up.

Hallucination shows up in different flavors: fabricated facts (the biography), outdated information stated as current (the meteor shower), inconsistent reproduction even with the source right there (the quote test). There are others too, wrong attributions, sycophantic agreement (going along with something you said even when it's wrong), confident extrapolation (extending a pattern beyond where the data supports it).

The mechanism is always the same, prediction filling a gap, but knowing the flavor helps you design the right mitigation. We'll get into those mitigations in later posts when we talk about grounding, evaluation, and guardrails.

If you're a builder, this'll feel familiar. Think of a DNS cache. You move your app to a new server, update the DNS record, but for the next hour some users still get routed to the old IP. The cache doesn't know the record changed. It just serves what it has, confidently, because it was designed to always give you an answer fast.

Or autoscaling on the wrong metric. You scale on CPU. CPU is low, so the system thinks everything's fine. Meanwhile your queue is backed up with 10,000 unprocessed messages. The system is optimized to respond to one signal, so it confidently does nothing while things pile up.

An AI model works the same way. It was trained to always produce a helpful-sounding answer. So when it doesn't know something, it still produces a helpful-sounding answer. It doesn't have a "say nothing" instinct. It has a "say something useful-looking" instinct.

Modern models are much better at refusing. But the underlying shape of the problem doesn't go away. The model doesn't know what it knows. It just predicts.

"But ChatGPT can search the web?"

Yes, most chat tools today can look things up online. That's not the model itself doing the searching. It is a tool plugged into the model.

We'll get to how that works in a later post. For today, we're looking at the model on its own. No internet, no tools. Just what it learned.

The fix, and where the fix breaks

I gave Nova Lite the actual article as a PDF and asked it to quote the second paragraph.

It gave me a response. Then I asked the same thing again. Different answer. Same source, same conversation, two different versions of the same paragraph.

Even with the source right there, it didn't pull the paragraph verbatim. I asked the same question twice, same conversation, same document, and got two different versions. It's not retrieving. It's still predicting what that paragraph probably looks like. And prediction isn't deterministic.

This matters because a lot of people think "just give the AI the document and it'll be fine."

It's better but it's not perfect. Things can get complex and messy, especially for anything that depends on exact wording, like legal text, medical dosages, or contract clauses. You still need to verify the responses.

Context reduces hallucination. It doesn't eliminate it.

Three signs you should double-check

If you're using AI day-to-day, here are the tells:

1. Specific details you can't verify. Names, dates, numbers, URLs in an area you can't check. Assume 50/50.

2. Fluency on topics that should be fuzzy. Ask about something niche or recent, get a confident detailed answer, and be suspicious. Real expertise has hedges, hallucination doesn't.

3. Citations. Especially URLs. Models invent sources that look real. If you get a URL, open it. Nine times out of ten it's fine. The tenth time it's a made-up paper.

Try it yourself

If you're more on the builder side:
Remember, hallucinations aren't a bug you patch. They're a property of the system. You mitigate them with grounding (give the model real context), with instructions (tell the model to refuse when unsure), and later, with evaluation. Designing around them is the job.

If you're just getting started:
Remember, AI is NOT a search engine. It's a prediction engine that's really good at sounding right. Treat specific claims the way you'd treat a confident stranger at a party. Friendly, but verify before you repeat them.

Some examples I found on internet, for fun and educational purposes only: (Answers may change as models are catching up)

How many 'r's are in the word strawberry?
If I have to take my car to car wash, and the car wash is 100ft away. Should I drive or go walking?

What's next

Why are there so many of these things? Haiku, Sonnet, Opus. Mini, large, pro. And honestly, which one should you actually pick?

That's the next post. Ride along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

What Even Is AI? (I Took a Break & Had to Relearn Everything)

Rohini Gaonkar — Tue, 05 May 2026 18:38:06 +0000

I just came back from maternity leave. And honestly? I felt like I'd missed a decade in six months. I talked about starting small in my other blog Lost in the AI Hype, I Started Small

I've spent the last fifteen years designing cloud systems. And even I felt behind. AI went from a thing people were experimenting with to a thing everyone's apparently building with, and I had no idea where to start.

So I did what any architect would do. I went back to first principles.

I'm rebuilding my AI mental model from scratch in public. No math. No expert-level coding. Just real problems, the architecture underneath, and honest notes on where things might break.

If you prefer video, please watch Episode 1 of my video series . If you prefer reading, you're in the right place.

The demo: AI adapts a recipe in under a minute

Before any theory, let me show you what these models can actually do.

I opened Amazon Bedrock Playground, pasted a real recipe, and asked three questions with each one pushing the model a little further:

1. Extract and summarise

"What are the core techniques in this recipe, strip off the fluff?"

Clean, fast, useful. You might think: that's a fancy Ctrl+F (search).

2. Interpret and advise

"Looking at this recipe, what's the thing that's most likely to go wrong for someone cooking it for the first time?"

Now we're somewhere a search tool genuinely can't go. The model is reasoning about the recipe like spotting the bit where people actually mess up.

3. Personalise

"I'm cooking this for six people on Saturday. One is vegan, one is gluten-free. Adapt the recipe, give me a shopping list, and a timeline starting from 4pm."

This is the moment. I asked it something I'd normally spend twenty minutes thinking through. It gave me a starting point in ten seconds.

If you're curious but not technical, that's already useful.

If you're a builder, you're probably already thinking so what happened here.

So what actually happened?

Here's the architecture, as simply as I can put it.

I sent text called a prompt to a foundation model.

People throw around terms like AI, LLMs, and foundation models like they all mean the same thing but they don’t.

Foundation Models > LLMs" width="800" height="450">

AI is the broad umbrella. It includes everything from recommendation engines and fraud detection systems to generative AI tools like ChatGPT.

Foundation models are a subset of AI, they are large models trained on massive datasets that can be adapted for different tasks. These aren’t just text models; they can generate images, video, speech, code, and more. Platforms like Amazon Bedrock give access to many of these models.

LLMs (Large Language Models) are a specific type of foundation model built for language tasks like answering questions, summarizing text, writing, or coding. So in my recipe demo, I was technically interacting with an LLM.

The simplest way to think about it:

AI → Foundation Models → LLMs

So, in our case it means its a big model trained on a huge mix of data for your day to day general purpose.

The model is a piece of software trained on an enormous amount of text: books, articles, code, conversations. It is not searching the internet. It learned patterns from all that text beforehand.

When I give it my prompt, it predicts the most useful response based on everything it learned.

Input (prompt) → Foundation Model → Output (response)

I've been building distributed systems for years, and a foundation model call is simpler than most of the APIs I'm used to. It's an HTTP request with text in, text out.

The complexity isn't in the call itself, it is in what the model learned before you or I ever showed up.

And this exact loop is what the entire current wave of AI is built on.

Every time you see a new Claude, or GPT, or Llama land, what's actually happening is someone trained a bigger or smarter version of this same idea.

Same loop. More data. Better prediction.

Where it breaks

The model doesn't know if it's right. It's predicting what a useful answer looks like. Sometimes that prediction is brilliant. Sometimes it invents something that sounds plausible and is completely wrong.

Every time you use one of these tools, ask yourself: what would I need to double-check before I trusted this?

That question is the single most useful habit you can build right now. We'll dig into why this happens in the next post.

Where the models live: Amazon Bedrock

You might've noticed I wasn't using ChatGPT or Claude's own website. I was using Amazon Bedrock.

Bedrock is where a bunch of foundation models live on AWS. Anthropic's Claude, Meta's Llama, Mistral, Amazon's own models, they are all callable through Bedrock, no need to run or train anything yourself.

The Playground is the easy door in, just type and go. Later in this series, when we start building, we'll call these same models from code. Same models, different door.

A note on my stack

I work at AWS. So the tools I use in this series are AWS tools like Bedrock for the models, and later, an AI-powered IDE called Kiro for building.

The concepts, though, aren't AWS specific. Foundation models, tokens, context windows, RAG, agents, these work the same way on any cloud. I'm showing you my stack. And honestly, I'm still figuring out which parts of it are great and which parts are a pain. You'll know which is which.

Try it yourself

If you're just getting started: open any AI chat tool (Bedrock Playground, Claude, ChatGPT, whatever you have access to), paste a recipe, a contract, a long email and ask it three questions:

One to summarise.
One to interpret.
One that's personal to you.

See what happens. That's your homework.

If you're more on the builder side: the mental model is simple: text in, model, text out. Everything we build in this series is a variation on that loop.

What's next

Next up: when AI sounds confident and is completely wrong. Why it happens, how to spot it, how to stop it.

This is a series. I'm learning this in public, building as I go, and being honest when things don't work. If that sounds useful, please follow along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles. Watch the video version or follow the series on dev.to.

Stop waiting to feel ready. You won’t. Not after a break. Not after life changes. Not when everything feels like it’s moving faster than you. I almost didn’t publish this.

Rohini Gaonkar — Wed, 15 Apr 2026 14:49:18 +0000

Lost in the AI Hype, I Started Small - DEV Community

And it helped me get back into tech without drowning TL;DR at the end Coming back to...

dev.to

Lost in the AI Hype, I Started Small

Rohini Gaonkar — Wed, 15 Apr 2026 14:32:01 +0000

And it helped me get back into tech without drowning

TL;DR at the end

Coming back to tech after a (maternity) break is a strange feeling.

You’re excited but also unsure where to begin. There are new tools, new terminologies and new way of doing things we did for decades.

But I didn’t try to figure everything out at one go, I just picked one small thing.

For me, that “one small thing” was finally building my portfolio collection.

Over the years, my content - blog posts, YouTube videos, conference talks, GitHub repos, social posts all of them scattered across dev.to, GitHub, YouTube, Instagram, LinkedIn, and AWS channels - has been scattered across a dozen different platforms. DEV.to, community.aws, YouTube, GitHub (two accounts!), LinkedIn, SlideShare... you name it. 🙇‍♀️

More than 80 pieces of content, scattered across platforms since 2015!!!

And honestly? Maintaining my existing site rohinigaonkar.com felt harder than starting from scratch.

I wanted something simpler, a lightweight site I can update by editing a single file, push to GitHub, and it's live. Easy to navigate, easy to maintain. No fluff.

I built this as my first project back from maternity leave, and I did it with Kiro, an AI-powered IDE from AWS that I'd never used before. Two firsts at once. It turned out to be the perfect re-entry project.

Building It with Kiro: My First Impressions

This portfolio had been on my mental to-do list forever, so the timing felt right. And rather than spinning up a complex stack to shake off the rust, I decided to keep it simple and lean on an AI coding assistant to help me get back into the flow.

Here's what stood out about the experience, starting with the simplest features and building up.

Last one might surprise you!!! 🤯

1. Chat-Driven Development

The entire project was built through conversation.

I described what I wanted, "I have this website where I collect my content shared across multiple social media websites. it is one true place where any tech content I posted on the web be it dev.to, github, youtube, instagram, and any aws first party channels, all of this to be collected as a timeline. can we build something that can be refreshed on demand and build this portfolio. make it professional looking. ask more intelligent questions as we go.", and Kiro asked clarifying questions even before writing a single line of code.

It asked about static vs dynamic, hosting preferences, design vibe, and data source approach. It even asked me about my identity. That back-and-forth shaped the architecture before any code was generated.

Once I provided response to all the questions, it help build the initial structure and also walked me through it.

Notice how after every conversation, it shows how many credits each prompt consumes, in real-time. That was nice!

There is a spec-driven development mode as well, which I would be testing for something more complex than this static website.

2. Web Search

Kiro searched the web to find my existing content across platforms. It looked up my dev.to profile, GitHub repos, community.aws presence, YouTube channel, and even my current website at rohinigaonkar.com. This gave it real context about who I am and what content already exists, so the portfolio wasn't built with placeholder data, it was seeded with my actual content from day one.

3. Explore Real-Time File Changes

As Kiro generated and edited the files, I could see every change happening in real time through the explorer. Either click the "Follow" option or click on the little "diff button" highlighted with yellow square below.

For example, it created index.html or content.js, I could immediately open them, review the code, and see the diffs. When it later modified content.js to add YouTube videos or reclassify talks vs videos, I could see exactly what changed and why.

4. Trusting Frequently Run Commands

Kiro ran shell commands like curl to hit APIs and extract data. It asked my permission to run it once or add it to trusted list of commands.

I also liked how it provided my levels of trust, like I can just execute this particular command or partial or the base command.

In autopilot mode, I could trust these commands to execute without approving each one individually. This was especially useful during the YouTube oEmbed batch processing, where Kiro ran 16 consecutive curl commands to fetch video titles, approving each one manually would have been tedious.

5. Iterative Refinement Through Conversation

The project evolved through multiple rounds of feedback:

I pointed out that "talks" should only mean conference/meetup presentations, not YouTube tutorial videos - Kiro reclassified everything accordingly
I noted that some AWS "talks" were actually just YouTube embeds on my website - Kiro dug into the pages, extracted the real YouTube URLs, and recategorized them as videos
I shared my personal GitHub profile separately from my work one - Kiro pulled repos from both and updated the refresh script to handle multiple accounts

Each round of feedback made the portfolio more accurate without starting over or deleting some other important information.

6. The YouTube Challenge : Hitting Walls and Finding Workarounds

This was the most interesting part as YouTube is heavily locked down:

Direct fetch failed - webFetch on youtube.com returned empty content
Rendered mode failed - returned only JavaScript bootstrap code, no actual page content
Search was noisy - web searches for my videos returned generic results, not my specific content
RSS feeds blocked - YouTube's channel RSS wasn't accessible either

*But Kiro didn't give up!!! 💜 *

It found workarounds on its own:

oEmbed API - Kiro discovered that YouTube's oEmbed endpoint (youtube.com/oembed?url=...) returns video titles as JSON, and used curl to call it directly. This became the reliable method for all 16 videos I shared.
Squarespace page parsing - For videos embedded on my website, Kiro parsed the raw HTML to extract YouTube video IDs from Squarespace's embed block JSON (double HTML-unescaping the content to find URLs like youtube.com%2Fembed%2Fi0zQpJPfSdU).
Thumbnail URL extraction - It even tried extracting video IDs from ytimg.com/vi/VIDEO_ID/hqdefault.jpg thumbnail patterns as a fallback.

The iterative problem-solving here like trying one approach, hitting a wall, pivoting to another, felt very much like how a developer would debug a scraping problem.

I loved how Kiro told me what it tried, it failed and it was going to try something else. We also worked together for a process that was a good compromise for both of us. Maybe in future I will have an agent to simplify this, but for now this solves my purpose!

The Result

A live static portfolio at rohinigaonkar.github.io with ~80 entries spanning 2015–2025, filterable by type (blogs, videos, repos, talks, social), searchable, and refreshable on demand via a Node.js script that pulls from GitHub and dev.to APIs. All built through conversation in a single Kiro session.

What's Next

I'll keep adding content as I publish it. The refresh script makes the API-sourced stuff automatic, and the manual entries take about 30 seconds each. I might add dark mode at some point, and maybe an RSS feed. But right now, the simplicity is the feature.

More importantly, this project reminded me that coming back doesn't have to be intimidating.

If you're returning from a break and looking for a low-pressure way to get back into coding, I'd recommend picking a passion project and giving Kiro a try. You might surprise yourself with how quickly you get back into the zone.

And follow along as I explore this world of AI.

TL;DR:
Returning to tech after maternity leave felt overwhelming, so I started small by building a portfolio site to consolidate 80+ pieces of content scattered across platforms since 2015. Using AWS Kiro (an AI-powered IDE) for the first time, I built the entire project through conversation—no complex setup needed. Kiro helped with web searches, real-time file changes, iterative refinements, and creative problem-solving (especially when YouTube's APIs were locked down). The result: a live, searchable portfolio at rohinigaonkar.github.io that's easy to maintain. The lesson? Coming back doesn't have to be intimidating—pick one small passion project and let AI tools help you get back into the zone.

Get into Cloud computing with no experience

Rohini Gaonkar — Wed, 08 Feb 2023 00:30:00 +0000

I have been working in the IT industry for almost 14 yrs now, in multiple roles, across the world, and luckily all of this experience has been with Cloud Computing technologies. Over the years hundreds of people have asked me same question - how do I get into Cloud based roles?

A lot of folks want to get into Cloud Computing roles BUT companies want to see experience on your CV/resume. It is a dilemma for lot of folks.

If you don't get hired into Cloud Computing projects, you can't get experience and without experience companies won't hire you

So I thought I should talk about this in detail and help folks who want to make switch to Cloud Computing, because Cloud is the new normal now and it should have a place on your resume.

Watch this short YouTube video on how you can build your skills, your resume and your brand to apply for jobs.

I do provide subtitles, and chapters so feel free to watch content you like.

Let me know in the comments section, if you have any questions or any topics I should elaborate more about. I am happy to guide folks so we all are successful in our own lives! I hope these give you a fair idea of where to start and motivate you to take your first steps into a successful career!

How to ‘Be your own cheerleader’

Rohini Gaonkar — Fri, 06 Jan 2023 17:12:08 +0000

Let me start by asking you a question -

do you feel unskilled at advocating for yourself and unsure of how to be proud of your work and accomplishments?

If you answered yes, then this blog post is for you.

Over the years, I have seen a common thread while mentoring tech community members – they do the hard work but they have poor self-esteem and self-worth. They do some amazing work, they have positively impacted so many lives, however, they don’t know how to showcase their work.

We humans tend to get demotivated over time and lose confidence in our own abilities. This impacts relationship with co-workers and even job interviews. Now some might not agree with me, but it is a harsh reality of today’s world; If you do not talk about your body of work, who will?

If you have attended my talks, then you would have heard me say this hundreds of times – BE YOUR OWN CHEERLEADER...!

If you do 99 things right and 1 mistake, which one do you obsess over?
Well, we typically obsess over that 1 mistake, because we are hard-wired to look at that one negative comment. Our memories are fickle, we tend to remember things that have stronger emotions associated with them. So how do you ensure that our minds do not obsess only on the negative, but also see the positives?

It’s simple, WRITE IT DOWN!

Trust me on this, I have done this activity with many of my mentees and it works EVERY. SINGLE. TIME. Identify the broad categories of your body of work – projects, volunteering, public speaking, blogging, social media, soft skills, certifications etc. Once you have identified the broad categories, list down all the achievements you can think of in these categories. It doesn’t matter if it is significant or not. This is YOUR list, and you don’t have to be ashamed of anything.

Are you usually quiet and you spoke up in your team meeting? Write it down! Pat on your back for overcoming the shyness, the fear. No one is going to judge you.

Once you write down the list, start adding meat to it.

For example, this is an actual excerpt from one of my mentees in Public Speaking category:

Conducted sessions for college students on zoom for Universities like X Y Z to spread awareness on Cloud, Helped students to be Market ready and mentor them for their innovative projects.

Now, if you are a recruiter or a manager who is building a case for your promotion, what would be your first reaction? do you get any idea on the impact of work here? No.

What about this..

Regularly conducts virtual sessions for college students for major universities like X Y Z, to spread awareness on Cloud. She continues to mentor students on their innovative projects, career guidance and be market ready.

Great so now with some basic language cleanup I made it concise, and a bit more professional.

However, we can do better. At this point, ask yourself a question – SO WHAT?

So, what should I do about this information? Why does it matter? Can I quantify this impact? Can I put it in perspective?

Regularly conducts virtual sessions for college students for major universities like X Y Z, to spread awareness on Cloud. The 15 sessions were attended by 2000+ students with avg CSAT 4.7/5. She continues to mentor students on their innovative projects, career guidance and be market ready.

Isn’t this better? You know what she does regularly, how many sessions, attended by how many students and that is an impressive CSAT (Customer satisfaction score). Provide examples using metrics or data is always more impactful than using fancy words or long paragraphs.

You can further make it more impressive by adding anecdotes or testimonials from attendees, and if this led to a further engagement like the Universities have now signed her up as a regular guest lecturer.

Make a collection

Create a folder on your computer for ‘appreciations’. Anytime someone says something nice about you, take a screenshot and store if for yourself. If you can, ask people to give you testimonials or LinkedIn recommendations. On days when you do not feel good, open this folder and read through. You will get instant motivation boost.

Make it a habit

I suggest doing this activity as soon as activities are done.

You can also create monthly, quarterly half-yearly and yearly snapshot of these activities.

If you are looking for a job change, make this list right now!

It will not only help you look at your accomplishments, this will help you show your interest and impact. You can add them to your resume or portfolio. During an interview, you can use this list to showcase impact and highlight your success stories.

You can build great examples using this for your behavioral-based interview questions in STAR method (Situation, Task, Action, and Result). Specifics are key; avoid generalizations. Give a detailed account of one situation for each question you answer, and use data or metrics to support your example.

Once you start building these writing skills, next time before you say yes for an activity, you will automatically assess if the activity has true measurable impact and helps you achieve your goals.

If you do not have the option of saying no, atleast you will assess how this activity impacts you in the greater scheme of things

Build your support system

Humans are social animals. While you are doing this writing activity by yourself, you need your own support system. Find people in the community, at your workplace, in your friend circle who support you and cheer you up on these accomplishments. Do not be arrogant or bullyish about your work. **Appreciate people who supported you in the journey. **You can summarise and publish this on your social media. Social media can be a great ego booster so why not use it for that? You should celebrate your accomplishments! Be your own cheerleader and the world will follow you!

Contrary to common belief, at workplace your manager should be part of your support system.

If you think your manager should inherently know all the awesome work you do, then you are setting yourself up for disappointment. Managers are not responsible for keeping track of your achievements. When you assume your manager owns your growth, it inevitably creates frustration for both parties. There's only so much each of you can do without the other.

As mentioned before, our own memories are fickle, how can we expect a human being (aka your manager) with many direct reportees to remember all your good work? Be a good direct reportee and make your manager’s job a little easier. Regularly share your work progress and achievements in a short, concise manner as discussed above. It makes it easier for them to remember and share with other internal parties.

So that’s it. Today we learnt how you can build your own list of achievements, make them concise, data oriented and share them with your support system. I hope some of these tips help you build your career and you continue to climb the ladder of success!

Builders Guide to AWS Summit Online India 2022

Rohini Gaonkar — Mon, 23 May 2022 08:11:44 +0000

For the last many years, I have had privilege to attend, speak and even curate agenda at AWS Summits, open/free tech conferences, that happen globally. This year I have been more closely involved and it gives me immense happiness to showcase what we have in store for you. This is a great conference to meet and network with similar minded enthusiasts about technology we all love.

Event highlights

Just like last 2 years, the AWS Summit Online India 2022 is a 2-day virtual technical conference, on Wednesday 25th May, and Thursday, 26th May; with opening keynotes starting from 9am IST.
And this is amazing as anyone from anywhere across the world can attend sessions and build network. Registration for the event is free, so go ahead and sign up for the event here.

The Summit will offer 150+ educational sessions, technical demonstrations, and keynotes featuring the the VP of AI at AWS Dr. Matt Wood with other AWS business leaders and AWS customers like Tally, ICICI Lombard, Apollo tyres and more.

Our AWS Community Hero for DevTools, Bhuvaneshwari Subramani, is presenting in the Day 1 Keynote.

The sessions are divided into 17 tracks with themes like AI/ML, Big Data, Cloud Security, Compute Anywhere along with 5 experiential zones - exam readiness sessions Training & Certification, deep dive demos in Builders Fair and Startup Central. Pick and join any sessions you wish based on the detailed Agenda on the event website.

Attend the Summit to know more about what is happening in the Cloud Computing space and how India is using AWS to build innovative applications. Most of the sessions have AWS customers across India talking about how they have been using AWS in their real-world solutions.

Sessions are labelled from Level 100, i.e. introductory level to level 400 i.e. expert level. So, attend sessions depending on your level of knowledge ranging from L100 - L400.

Community track

I am happy to introduce you to the 'Build on AWS' track where for the first time ever, our community members are presenting in the AWS Summit India, alongside me. We will present advanced deep dive sessions on topics like Containers, Serverless, DevOps, Data Lakes and more.

Architecting for sustainability Level: 300 | Advanced

I will be presenting this session, where I will dive deep into techniques recommended by the AWS Well-Architected Framework Sustainability pillar and provide direction on reducing the energy and carbon impact of AWS architectures. Learn about best practices, which organisations of any size can apply to their workloads and how to review the new Customer Carbon Footprint Tool report with a demo.
Connect with me on LinkedIn

End-to-End CI/CD at scale with infrastructure-as-code on AWS Level: 300 | Advanced

AWS DevTools Hero - Bhuvaneswari Subramani, dives deep into building a production ready, multi-account, at scale CI/CD pipeline using your own Jenkins, with infrastructure-as-code using AWS CloudFormation, and discuss best practices for building DevOps capabilities for your container applications running on AWS.

Just-in-time worker nodes for Amazon Elastic Kubernetes Service using Karpenter Level: 300 | Advanced

AWS Container Hero - Dijeesh Padinharethil, will demonstrate how Karpenter simplifies Kubernetes infrastructure with the right nodes at the right time. Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS

Build serverless apps with SAM Accelerate and SAM Pipelines Level: 300 | Advanced

AWS Community Builder - Jones Zachariah Noel N, demonstrates how to use SAM templates to manage serverless infrastructure as code. He will share best practices for using the AWS SAM CLI and the recently announced AWS SAM Accelerate to develop and debug serverless applications on your local machine. He will also showcase the ease of CI/CD workflows with SAM pipelines to multiple staging environments.

Build performant, scalable and secure GraphQL APIs Level: 300 | Advanced

AWS Community Hero - Dipali K, and AWS UG Delhi leader - Rajat Arora, take you through security best practices for your GraphQL API’s with AWS AppSync, and Amazon Cognito for easy management, improved performance, and observability.

Build sustainable applications on AWS using Rust Level: 200 | Intermediate

AWS Solutions Architect Developer Specialist - Sundararajan Narasiman, will dive into the “super powers” of Rust, hear about the work ahead to give those powers to every engineer, and learn about the ways in which you can contribute. He will cover how to build a lambda function using Rust and deploy to AWS.

Build a data lake with AWS Lake Formation Level: 200 | Intermediate

AWS Community Builder - Sanchit Jain, will explore data lake challenges and how AWS Lake Formation can help you. If you're a developer, DBA, or a data engineer who works with data, this session is for you.

Build a real time pipeline to ingest streaming data Level: 300 | Advanced

AWS Data Hero - Sridevi Murugayen, will look at how you can build streaming data analytics pipelines that speed up time to information from hours to seconds. She will discuss how streaming-data services like Amazon Kinesis are used to capture and analyse data in real time from hundreds of sources, with AWS Lambda, stored in Amazon S3, then process it with Amazon EMR and deliver it to your Amazon Redshift data warehouse.

Goodies up for grab

I would recommend watching few noteworthy sessions live. The live experience will get you an opportunity to connect with peers, tech enthusiasts, ask questions to experts, and last but not the least, win swags.

You are eligible for the certificate of attendance, as long as you complete watching 5 sessions or more during the conference and you stand a chance to win $25 AWS credits too. Also, watch 2 or more express trainings in the Training & Certification Zone and be amongst the first 2000 to get the discount certification voucher!

AWS Summit India event Website - https://aws.amazon.com/events/summits/india/https://aws.amazon.com/events/summits/india/

DEV Community: Rohini Gaonkar

I Switched to the Agent Toolkit for AWS. Here's Why.

What is it?

Why I switched

Security that actually means something.

Sandboxed code execution.

I can see what it did.

Built-in docs search.

Expert skills.

Multi-profile support.

Side by side

Getting started

Prerequisites

Disable conflicting servers

MCP Configuration

Verify

Links

How to make AI answer questions about your documents

The problem: your document is too long

Retrieval-Augmented Generation

The setup: onboarding docs for a fictional company

Step 1: Chunk the document

Step 2: Turn chunks into embeddings

Step 3: Retrieve and Generate

Where it breaks: a quick preview

Key takeaways

What's next

Why does AI forget what you said (and how to fix it)

What a context window actually is

Two shapes of the same problem

Documents

Conversations

Lost in the middle

What you can do about it

Bigger window

Chunk

Summarise

Position

Restate

System prompt

Fresh start

Build your own memory layer

Key takeaways

What's next

Rohini GaonkarFollow

Bigger AI models aren't always better. Here's how to actually choose.

The demo: same prompt, two models

The small model

The large model

Why models come in sizes

Tokens and pricing: how you actually pay

Real numbers

Where it breaks: when bigger is worse

How to actually choose

What about picking a provider?

Try it yourself

What's next

Thank you for featuring me! 💜

Top 7 Featured DEV Posts of the Week

Why does AI lie? Hallucinations explained simply

The demo: same question, two models

Model 1: Amazon Nova Micro 1.0

Model 2: Claude Haiku 4.5

The biography test

Why this happens: the architecture

"But ChatGPT can search the web?"

The fix, and where the fix breaks

Three signs you should double-check

Try it yourself

What's next

What Even Is AI? (I Took a Break & Had to Relearn Everything)

The demo: AI adapts a recipe in under a minute

So what actually happened?

Where it breaks

Where the models live: Amazon Bedrock

A note on my stack

Try it yourself

What's next

Stop waiting to feel ready. You won’t. Not after a break. Not after life changes. Not when everything feels like it’s moving faster than you. I almost didn’t publish this.

Lost in the AI Hype, I Started Small - DEV Community