Suhas Mallesh

Posted on Mar 2

Bedrock Knowledge Base Advanced RAG with Terraform: Chunking, Hybrid Search, and Reranking 🧠

#ai #terraform #devops #aws

Fixed-size chunking is just the starting point. Semantic chunking, hierarchical retrieval, hybrid search, reranking, and metadata filtering turn a basic RAG pipeline into a production system. All configurable in

In RAG Post 1, we deployed a basic Bedrock Knowledge Base with fixed-size chunking. It works, but retrieval quality is mediocre. Your users ask complex questions and get incomplete answers. The model pulls in irrelevant chunks while missing the ones that matter.

The fix isn't a better model. It's better retrieval. Bedrock Knowledge Bases supports four chunking strategies, hybrid search, reranking models, metadata filtering, and query decomposition. All of these are configurable through Terraform and the retrieval API. This post covers the production patterns that separate a demo from a system your users actually trust. 🎯

🧱 Chunking Strategies: Choosing the Right One

Chunking is the single biggest lever for RAG quality. How you split documents determines what gets retrieved. Bedrock supports four strategies:

Strategy	How It Works	Best For	Terraform Value
FIXED_SIZE	Split every N tokens with overlap	General purpose, predictable costs	`FIXED_SIZE`
HIERARCHICAL	Parent/child chunks - search on children, return parents	Long docs with nested structure (manuals, legal)	`HIERARCHICAL`
SEMANTIC	Split by meaning using embedding similarity	Dense prose, technical docs	`SEMANTIC`
NONE	Each file = one chunk	Pre-processed documents	`NONE`

Fixed-Size (Baseline)

Good starting point. Predictable chunk sizes make cost estimation easy:

vector_ingestion_configuration {
  chunking_configuration {
    chunking_strategy = "FIXED_SIZE"
    fixed_size_chunking_configuration {
      max_tokens         = 512
      overlap_percentage = 20
    }
  }
}

Tuning guide: Start with 512 tokens and 20% overlap. If answers lack context, increase to 1024. If retrieval returns too much noise, decrease to 256.

Hierarchical (Best for Complex Documents)

This is where retrieval quality jumps. Bedrock searches on small child chunks for precision, then returns the larger parent chunk for context:

vector_ingestion_configuration {
  chunking_configuration {
    chunking_strategy = "HIERARCHICAL"
    hierarchical_chunking_configuration {
      level_configuration {
        max_tokens = 1500
      }
      level_configuration {
        max_tokens = 300
      }
      overlap_tokens = 60
    }
  }
}

How it works: A 10-page legal document gets split into parent chunks (~1500 tokens, roughly a page) and child chunks (~300 tokens, roughly a paragraph). When a user asks a question, Bedrock matches against the precise child chunks, then replaces them with the broader parent chunks before passing context to the model. This means the model sees a full page of context rather than an isolated paragraph.

Trade-off: You may get fewer results than numberOfResults requests, because multiple child chunks can map to the same parent.

Semantic (Best for Dense Prose)

Semantic chunking uses an embedding model to split text at natural semantic boundaries:

vector_ingestion_configuration {
  chunking_configuration {
    chunking_strategy = "SEMANTIC"
    semantic_chunking_configuration {
      max_tokens            = 512
      buffer_size           = 1
      breakpoint_percentile_threshold = 95
    }
  }
}

When to use: Documents where meaning doesn't align with fixed boundaries - research papers, narrative reports, policy documents. The breakpoint_percentile_threshold (0-99) controls sensitivity: higher values create fewer, larger chunks.

Cost note: Semantic chunking calls a foundation model during ingestion, adding cost per document. Factor this into your ingestion pipeline budget.

🔍 Hybrid Search

By default, Bedrock uses semantic (vector) search only. Hybrid search combines vector similarity with keyword (BM25) matching. This catches cases where exact terminology matters - product codes, legal references, technical terms:

response = client.retrieve(
    knowledgeBaseId="YOUR_KB_ID",
    retrievalQuery={"text": "What is policy ABC-123?"},
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "overrideSearchType": "HYBRID"
        }
    }
)

When to enable: Any knowledge base where users search for specific identifiers, codes, or exact phrases alongside natural language questions. Hybrid search is supported on OpenSearch Serverless and Amazon RDS vector stores.

📊 Reranking

Retrieval returns the top-K chunks by vector similarity. But similarity isn't the same as relevance. A reranking model re-scores those chunks using a deeper understanding of the query-document relationship:

response = client.retrieve_and_generate(
    input={"text": "What are the penalties for late payment?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-20250514",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 15,
                    "overrideSearchType": "HYBRID"
                }
            },
            "orchestrationConfiguration": {
                "rerankingConfiguration": {
                    "bedrockRerankingConfiguration": {
                        "modelConfiguration": {
                            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/cohere.rerank-v3-5:0"
                        },
                        "numberOfRerankedResults": 5
                    }
                }
            }
        }
    }
)

Pattern: Retrieve 15 chunks with hybrid search, rerank down to 5 with Cohere Rerank. This "retrieve wide, rerank narrow" pattern consistently outperforms retrieving 5 directly.

🏷️ Metadata Filtering

Not every chunk is equal. Metadata filtering lets you scope retrieval to specific document categories, dates, or sources:

response = client.retrieve(
    knowledgeBaseId="YOUR_KB_ID",
    retrievalQuery={"text": "What changed in the refund policy?"},
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "filter": {
                "andAll": [
                    {"equals": {"key": "department", "value": "legal"}},
                    {"greaterThan": {"key": "year", "value": 2024}}
                ]
            }
        }
    }
)

Metadata is defined per-document using a companion .metadata.json file in S3:

{
  "metadataAttributes": {
    "department": { "value": "legal", "type": "STRING" },
    "year": { "value": 2025, "type": "NUMBER" },
    "confidential": { "value": false, "type": "BOOLEAN" }
  }
}

Place this file alongside your document: policy.pdf gets policy.pdf.metadata.json.

🔀 Query Decomposition

Complex questions often contain multiple intents. Query decomposition breaks them into sub-queries that are each answered independently:

response = client.retrieve_and_generate(
    input={"text": "Compare the 2024 and 2025 refund policies and highlight what changed"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-20250514",
            "orchestrationConfiguration": {
                "queryTransformationConfiguration": {
                    "type": "QUERY_DECOMPOSITION"
                }
            }
        }
    }
)

Bedrock may decompose this into: "What is the 2024 refund policy?" and "What is the 2025 refund policy?" - retrieving separately then synthesizing. This increases API calls but significantly improves answers for comparative or multi-part questions.

📐 Putting It All Together: Production Config

Here's a production-ready Terraform data source configuration combining hierarchical chunking with the retrieval optimizations above:

# rag/data_source_prod.tf

resource "aws_bedrockagent_data_source" "s3_prod" {
  name              = "${var.environment}-${var.kb_name}-s3-source"
  knowledge_base_id = aws_bedrockagent_knowledge_base.this.id

  data_source_configuration {
    type = "S3"
    s3_configuration {
      bucket_arn              = aws_s3_bucket.knowledge_base_docs.arn
      inclusion_prefixes      = var.s3_inclusion_prefixes
    }
  }

  vector_ingestion_configuration {
    chunking_configuration {
      chunking_strategy = var.chunking_strategy

      dynamic "hierarchical_chunking_configuration" {
        for_each = var.chunking_strategy == "HIERARCHICAL" ? [1] : []
        content {
          level_configuration {
            max_tokens = var.parent_chunk_tokens
          }
          level_configuration {
            max_tokens = var.child_chunk_tokens
          }
          overlap_tokens = var.overlap_tokens
        }
      }

      dynamic "semantic_chunking_configuration" {
        for_each = var.chunking_strategy == "SEMANTIC" ? [1] : []
        content {
          max_tokens                      = var.semantic_max_tokens
          buffer_size                     = var.semantic_buffer_size
          breakpoint_percentile_threshold = var.semantic_breakpoint_threshold
        }
      }

      dynamic "fixed_size_chunking_configuration" {
        for_each = var.chunking_strategy == "FIXED_SIZE" ? [1] : []
        content {
          max_tokens         = var.fixed_max_tokens
          overlap_percentage = var.fixed_overlap_percentage
        }
      }
    }
  }
}

With environment-specific variables:

# environments/dev.tfvars
chunking_strategy    = "FIXED_SIZE"
fixed_max_tokens     = 300
fixed_overlap_percentage = 10

# environments/prod.tfvars
chunking_strategy    = "HIERARCHICAL"
parent_chunk_tokens  = 1500
child_chunk_tokens   = 300
overlap_tokens       = 60

🔄 Azure vs AWS vs GCP: Advanced RAG Comparison

Feature	Azure AI Search	AWS Bedrock KB	GCP RAG Engine
Chunking	Fixed-size + Document Layout skill	Fixed, hierarchical, semantic, Lambda	Fixed-size only
Hybrid search	BM25 + vector via RRF (built-in)	Supported on OpenSearch	Alpha-weighted dense/sparse
Semantic reranking	Built-in transformer ranker (L2)	Cohere Rerank	Rank Service + LLM Ranker
Query decomposition	Agentic retrieval (native)	Native API parameter	Not built-in
Metadata filtering	Filterable index fields + OData	JSON metadata files in S3	Filter string at query time
Strictness control	1-5 scale on data source	Not built-in	Vector distance threshold
Reranker score range	0-4 (calibrated, cross-query consistent)	Model-dependent	Model-dependent

💡 Decision Framework

Your Documents	Recommended Chunking	Search Type	Reranking
Short FAQs, structured content	FIXED_SIZE (256 tokens)	Hybrid	Optional
Long manuals, legal docs	HIERARCHICAL (1500/300)	Hybrid	Yes
Research papers, dense prose	SEMANTIC (512 tokens)	Semantic	Yes
Pre-chunked content	NONE	Hybrid	Optional
Mixed document types	HIERARCHICAL (safest default)	Hybrid	Yes

Start with HIERARCHICAL + hybrid search + reranking for production. It's the most robust default. Only switch to SEMANTIC if your documents are uniformly dense prose. Use FIXED_SIZE in dev to iterate quickly.

⏭️ What's Next

This is Post 2 of the AWS RAG Pipeline with Terraform series.

Post 1: Bedrock Knowledge Base - Basic Setup 🔍
Post 2: Advanced RAG - Chunking, Search, Reranking (you are here) 🧠

Your RAG pipeline just went from demo to production. Hierarchical chunking for context, hybrid search for precision, reranking for relevance, metadata filtering for scope - all driven by Terraform variables per environment. 🧠

Found this helpful? Follow for the full RAG Pipeline with Terraform series! 💬

DEV Community