Fixed-size chunking is just the starting point. Semantic chunking, hierarchical retrieval, hybrid search, reranking, and metadata filtering turn a basic RAG pipeline into a production system. All configurable in
In RAG Post 1, we deployed a basic Bedrock Knowledge Base with fixed-size chunking. It works, but retrieval quality is mediocre. Your users ask complex questions and get incomplete answers. The model pulls in irrelevant chunks while missing the ones that matter.
The fix isn't a better model. It's better retrieval. Bedrock Knowledge Bases supports four chunking strategies, hybrid search, reranking models, metadata filtering, and query decomposition. All of these are configurable through Terraform and the retrieval API. This post covers the production patterns that separate a demo from a system your users actually trust. π―
π§± Chunking Strategies: Choosing the Right One
Chunking is the single biggest lever for RAG quality. How you split documents determines what gets retrieved. Bedrock supports four strategies:
| Strategy | How It Works | Best For | Terraform Value |
|---|---|---|---|
| FIXED_SIZE | Split every N tokens with overlap | General purpose, predictable costs | FIXED_SIZE |
| HIERARCHICAL | Parent/child chunks - search on children, return parents | Long docs with nested structure (manuals, legal) | HIERARCHICAL |
| SEMANTIC | Split by meaning using embedding similarity | Dense prose, technical docs | SEMANTIC |
| NONE | Each file = one chunk | Pre-processed documents | NONE |
Fixed-Size (Baseline)
Good starting point. Predictable chunk sizes make cost estimation easy:
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = "FIXED_SIZE"
fixed_size_chunking_configuration {
max_tokens = 512
overlap_percentage = 20
}
}
}
Tuning guide: Start with 512 tokens and 20% overlap. If answers lack context, increase to 1024. If retrieval returns too much noise, decrease to 256.
Hierarchical (Best for Complex Documents)
This is where retrieval quality jumps. Bedrock searches on small child chunks for precision, then returns the larger parent chunk for context:
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = "HIERARCHICAL"
hierarchical_chunking_configuration {
level_configuration {
max_tokens = 1500
}
level_configuration {
max_tokens = 300
}
overlap_tokens = 60
}
}
}
How it works: A 10-page legal document gets split into parent chunks (~1500 tokens, roughly a page) and child chunks (~300 tokens, roughly a paragraph). When a user asks a question, Bedrock matches against the precise child chunks, then replaces them with the broader parent chunks before passing context to the model. This means the model sees a full page of context rather than an isolated paragraph.
Trade-off: You may get fewer results than numberOfResults requests, because multiple child chunks can map to the same parent.
Semantic (Best for Dense Prose)
Semantic chunking uses an embedding model to split text at natural semantic boundaries:
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = "SEMANTIC"
semantic_chunking_configuration {
max_tokens = 512
buffer_size = 1
breakpoint_percentile_threshold = 95
}
}
}
When to use: Documents where meaning doesn't align with fixed boundaries - research papers, narrative reports, policy documents. The breakpoint_percentile_threshold (0-99) controls sensitivity: higher values create fewer, larger chunks.
Cost note: Semantic chunking calls a foundation model during ingestion, adding cost per document. Factor this into your ingestion pipeline budget.
π Hybrid Search
By default, Bedrock uses semantic (vector) search only. Hybrid search combines vector similarity with keyword (BM25) matching. This catches cases where exact terminology matters - product codes, legal references, technical terms:
response = client.retrieve(
knowledgeBaseId="YOUR_KB_ID",
retrievalQuery={"text": "What is policy ABC-123?"},
retrievalConfiguration={
"vectorSearchConfiguration": {
"overrideSearchType": "HYBRID"
}
}
)
When to enable: Any knowledge base where users search for specific identifiers, codes, or exact phrases alongside natural language questions. Hybrid search is supported on OpenSearch Serverless and Amazon RDS vector stores.
π Reranking
Retrieval returns the top-K chunks by vector similarity. But similarity isn't the same as relevance. A reranking model re-scores those chunks using a deeper understanding of the query-document relationship:
response = client.retrieve_and_generate(
input={"text": "What are the penalties for late payment?"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": "YOUR_KB_ID",
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-20250514",
"retrievalConfiguration": {
"vectorSearchConfiguration": {
"numberOfResults": 15,
"overrideSearchType": "HYBRID"
}
},
"orchestrationConfiguration": {
"rerankingConfiguration": {
"bedrockRerankingConfiguration": {
"modelConfiguration": {
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/cohere.rerank-v3-5:0"
},
"numberOfRerankedResults": 5
}
}
}
}
}
)
Pattern: Retrieve 15 chunks with hybrid search, rerank down to 5 with Cohere Rerank. This "retrieve wide, rerank narrow" pattern consistently outperforms retrieving 5 directly.
π·οΈ Metadata Filtering
Not every chunk is equal. Metadata filtering lets you scope retrieval to specific document categories, dates, or sources:
response = client.retrieve(
knowledgeBaseId="YOUR_KB_ID",
retrievalQuery={"text": "What changed in the refund policy?"},
retrievalConfiguration={
"vectorSearchConfiguration": {
"filter": {
"andAll": [
{"equals": {"key": "department", "value": "legal"}},
{"greaterThan": {"key": "year", "value": 2024}}
]
}
}
}
)
Metadata is defined per-document using a companion .metadata.json file in S3:
{
"metadataAttributes": {
"department": { "value": "legal", "type": "STRING" },
"year": { "value": 2025, "type": "NUMBER" },
"confidential": { "value": false, "type": "BOOLEAN" }
}
}
Place this file alongside your document: policy.pdf gets policy.pdf.metadata.json.
π Query Decomposition
Complex questions often contain multiple intents. Query decomposition breaks them into sub-queries that are each answered independently:
response = client.retrieve_and_generate(
input={"text": "Compare the 2024 and 2025 refund policies and highlight what changed"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": "YOUR_KB_ID",
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-20250514",
"orchestrationConfiguration": {
"queryTransformationConfiguration": {
"type": "QUERY_DECOMPOSITION"
}
}
}
}
)
Bedrock may decompose this into: "What is the 2024 refund policy?" and "What is the 2025 refund policy?" - retrieving separately then synthesizing. This increases API calls but significantly improves answers for comparative or multi-part questions.
π Putting It All Together: Production Config
Here's a production-ready Terraform data source configuration combining hierarchical chunking with the retrieval optimizations above:
# rag/data_source_prod.tf
resource "aws_bedrockagent_data_source" "s3_prod" {
name = "${var.environment}-${var.kb_name}-s3-source"
knowledge_base_id = aws_bedrockagent_knowledge_base.this.id
data_source_configuration {
type = "S3"
s3_configuration {
bucket_arn = aws_s3_bucket.knowledge_base_docs.arn
inclusion_prefixes = var.s3_inclusion_prefixes
}
}
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = var.chunking_strategy
dynamic "hierarchical_chunking_configuration" {
for_each = var.chunking_strategy == "HIERARCHICAL" ? [1] : []
content {
level_configuration {
max_tokens = var.parent_chunk_tokens
}
level_configuration {
max_tokens = var.child_chunk_tokens
}
overlap_tokens = var.overlap_tokens
}
}
dynamic "semantic_chunking_configuration" {
for_each = var.chunking_strategy == "SEMANTIC" ? [1] : []
content {
max_tokens = var.semantic_max_tokens
buffer_size = var.semantic_buffer_size
breakpoint_percentile_threshold = var.semantic_breakpoint_threshold
}
}
dynamic "fixed_size_chunking_configuration" {
for_each = var.chunking_strategy == "FIXED_SIZE" ? [1] : []
content {
max_tokens = var.fixed_max_tokens
overlap_percentage = var.fixed_overlap_percentage
}
}
}
}
}
With environment-specific variables:
# environments/dev.tfvars
chunking_strategy = "FIXED_SIZE"
fixed_max_tokens = 300
fixed_overlap_percentage = 10
# environments/prod.tfvars
chunking_strategy = "HIERARCHICAL"
parent_chunk_tokens = 1500
child_chunk_tokens = 300
overlap_tokens = 60
π Azure vs AWS vs GCP: Advanced RAG Comparison
| Feature | Azure AI Search | AWS Bedrock KB | GCP RAG Engine |
|---|---|---|---|
| Chunking | Fixed-size + Document Layout skill | Fixed, hierarchical, semantic, Lambda | Fixed-size only |
| Hybrid search | BM25 + vector via RRF (built-in) | Supported on OpenSearch | Alpha-weighted dense/sparse |
| Semantic reranking | Built-in transformer ranker (L2) | Cohere Rerank | Rank Service + LLM Ranker |
| Query decomposition | Agentic retrieval (native) | Native API parameter | Not built-in |
| Metadata filtering | Filterable index fields + OData | JSON metadata files in S3 | Filter string at query time |
| Strictness control | 1-5 scale on data source | Not built-in | Vector distance threshold |
| Reranker score range | 0-4 (calibrated, cross-query consistent) | Model-dependent | Model-dependent |
π‘ Decision Framework
| Your Documents | Recommended Chunking | Search Type | Reranking |
|---|---|---|---|
| Short FAQs, structured content | FIXED_SIZE (256 tokens) | Hybrid | Optional |
| Long manuals, legal docs | HIERARCHICAL (1500/300) | Hybrid | Yes |
| Research papers, dense prose | SEMANTIC (512 tokens) | Semantic | Yes |
| Pre-chunked content | NONE | Hybrid | Optional |
| Mixed document types | HIERARCHICAL (safest default) | Hybrid | Yes |
Start with HIERARCHICAL + hybrid search + reranking for production. It's the most robust default. Only switch to SEMANTIC if your documents are uniformly dense prose. Use FIXED_SIZE in dev to iterate quickly.
βοΈ What's Next
This is Post 2 of the AWS RAG Pipeline with Terraform series.
- Post 1: Bedrock Knowledge Base - Basic Setup π
- Post 2: Advanced RAG - Chunking, Search, Reranking (you are here) π§
Your RAG pipeline just went from demo to production. Hierarchical chunking for context, hybrid search for precision, reranking for relevance, metadata filtering for scope - all driven by Terraform variables per environment. π§
Found this helpful? Follow for the full RAG Pipeline with Terraform series! π¬
Top comments (0)