I Let My AI Agent Build a Bedrock RAG Knowledge Base, Here Are the 2 Mistakes the AWS Agent Toolkit Caught

#aws #ai #agentskills #bedrock

Provisioning a Bedrock RAG knowledge base with S3 Vectors, without the hallucinated API calls.

If you've asked an AI coding agent to set up AWS, you've seen it confidently invent a parameter, reach for a deprecated service, or burn ten minutes retrying against a service it never saw in training. The failure mode that bites hardest is the silent one: the agent thinks it succeeded, and you find out an hour later.

I hit two of these while standing up the retrieval layer for a LangGraph support bot, an Amazon Bedrock Knowledge Base backed by Amazon S3 Vectors. I'd love to say I caught both with deep AWS expertise. I caught them because the Agent Toolkit for AWS read the docs I hadn't. Both would have shipped, and neither did.

The 30-second setup

The goal: take a folder of markdown product docs and make them queryable by meaning, so an agent can answer "is this safe for color-treated hair?" from the real docs instead of guessing. Think of it as giving the agent a library it can search instead of making things up. That's the retrieval half of RAG, the foundation a LangGraph agent will later call as a tool.

Four moving parts, wrapped in one managed service:

Source bucket: an S3 bucket holding the docs.
Embeddings: Amazon Titan Text Embeddings V2 (1024-dim vectors).
Vector store: Amazon S3 Vectors. I chose it over OpenSearch Serverless because it has no always-on compute, the difference between cents and a monthly surprise for a demo that sits idle.
Knowledge Base: Amazon Bedrock Knowledge Bases ties it together into one thing you can query with a retrieve call.

To follow along, you need an AWS account, a non-root IAM identity with credentials configured locally, uv installed, and the toolkit installed in your agent. The fastest path across Kiro, Claude Code, Cursor, and Codex is the AWS CLI installer, aws configure agent-toolkit; in Kiro you can instead add the AWS MCP Server to .kiro/settings/mcp.json (pin the mcp-proxy-for-aws version) and run npx skills add aws/agent-toolkit-for-aws/skills. The toolkit plugs into the agent you already use and loads task-specific skills on demand; I used the amazon-bedrock skill, which carries the validated, current procedure for building a Knowledge Base. That word, "current," is the whole story.

Gotcha #1: the model id was already dead

My first instinct, straight from an older tutorial, was anthropic.claude-3-5-sonnet-20240620-v1:0. Calling it returned:

ResourceNotFoundException: This model version has reached the end of its life.

The fix the toolkit's doc search surfaced: current Anthropic models on Bedrock are inference-profile only. You invoke them through a cross-region profile id like us.anthropic.claude-sonnet-4-5-20250929-v1:0, not the bare on-demand id.

On its own, an agent might not even diagnose this correctly. "Not found" reads like a permissions or region problem, so it could swap in another stale id and hit "on-demand throughput isn't supported" instead, flailing sideways. The toolkit got it right because it read the current model docs, not because it happened to remember them.

Gotcha #2: Bedrock won't create the S3 Vectors index for you

I created the vector bucket, pointed the Knowledge Base at an index name, and assumed Bedrock would create the index. It didn't:

ValidationException: The specified index could not be found (S3Vectors 404)

The real requirement, from the S3 Vectors docs: you create the index yourself, and it must declare two non-filterable metadata keys that Bedrock uses to store chunk text and metadata. Miss them and ingestion fails later with a cryptic error far from the cause. The working command:

aws s3vectors create-index \
  --vector-bucket-name <VECTOR_BUCKET> \
  --index-name <INDEX_NAME> \
  --data-type float32 --dimension 1024 --distance-metric cosine \
  --metadata-configuration '{"nonFilterableMetadataKeys":["AMAZON_BEDROCK_TEXT","AMAZON_BEDROCK_METADATA"]}' \
  --region us-east-2

This is the one that best captures why current docs matter. S3 Vectors launched in 2025, so the requirement isn't in most models' training data. A toolkit-less agent would most likely create the index, think it succeeded, and only hit the wall at ingestion time, then burn an afternoon recreating it with the wrong config. The dimension (1024) and distance metric here aren't arbitrary either: they have to match the Titan embedding model, which is the kind of cross-resource constraint an agent gets wrong when it's guessing.

The rest fell into place, and it works

With those two out of the way, the validated sequence ran clean: create the IAM service role (trust bedrock.amazonaws.com with confused-deputy conditions, so another customer can't trick the role into acting on their resources, plus least-privilege permissions to invoke Titan, read the bucket, and use the vector index), create the Knowledge Base, attach the S3 data source with fixed-size chunking (300 tokens, 20% overlap), and run ingestion. Result: 10/10 documents indexed, zero failures.

The proof is a retrieval query:

aws bedrock-agent-runtime retrieve \
  --knowledge-base-id <KB_ID> \
  --retrieval-query '{"text":"Is the Curl Cream safe for color-treated hair?"}' \
  --region us-east-2

Top hit came back at 0.86 similarity, on the exact product doc with the right answer. The library is stocked.

What it bought me, and what I'd do differently

Strip away the demo and the toolkit changed two things: it handed the agent the validated setup order up front (no trial-and-error), and it caught two mistakes a model trained months ago wouldn't know, because it checks current docs and ships procedures AWS maintains. AWS reports developers see fewer iterations and errors with it; on this build, the two catches alone saved me an afternoon.

Two honest gaps. First, the toolkit's own rules recommend infrastructure-as-code over direct CLI, and I didn't follow that. I ran CLI calls and tracked them in a tagged manifest for teardown. It works, but CDK or CloudFormation would be the reproducible artifact a reader could clone. Second, I left the IAM role's trust policy scoped to knowledge-base/* instead of the specific KB id; tightening that aws:SourceArn is the obvious hardening step before this is anything but a demo.

What's next

This is the retrieval foundation, not the whole app. Two concrete next steps, and you could take either:

Close the loop. Wire a LangGraph agent to call this Knowledge Base as a tool, so it retrieves and generates grounded answers. That's when "RAG knowledge base" graduates to "RAG application."
Make it reproducible. Convert the ad-hoc CLI provisioning into CDK or CloudFormation, so the whole stack stands up and tears down with one command, the way the toolkit's own rules recommend.

If you take one thing: the toolkit's real value isn't typing commands for you, it's making better decisions, grounded in current docs, on the things an AI agent gets wrong in ways you don't notice until an hour later.