Gerardo Arroyo for AWS Community Builders

Posted on Apr 24 • Originally published at gerardo.dev

Real Benchmark: 5 Chunking Strategies in Amazon Bedrock Knowledge Bases

#aws #bedrock #knowledgebases #s3vectors

A few weeks ago I ran into a question I've been hearing more and more in conversations with architects and dev teams:

"I'm going to implement a RAG with Bedrock Knowledge Bases. Which chunking strategy should I use? I see there are five and they all sound reasonable."

It's a fair question, and honestly I didn't have an answer that left me satisfied. The AWS docs describe each strategy clearly. Tech blogs discuss them in conceptual terms. The comparisons I'd seen usually stopped at "each one has its use case." But very little concrete data on how they behave against a real corpus.

So I decided to run the benchmark myself. With a reproducible methodology, real data, and objective metrics. What I found surprised me enough to make it worth writing this article, because reality is quite different from what the documentation suggests.

🎯 Spoiler: Of the 5 strategies, only 3 could process a real technical documentation corpus. The other 2 failed at the ingestion stage — not because of poor chunk quality, but due to hard service limits that aren't mentioned at the moment you pick the strategy.

In this article I'm sharing the full methodology, the quantitative results (25 questions evaluated with LLM-as-a-judge), and something I find even more valuable: the 7 infrastructure problems I had to solve to get everything running with Terraform. Because the "official" sample code assumes things that aren't always true.

📌 TL;DR — Key data before you read on

Titan V2 embeddings: 50,000-character / 8,192-token limit per request → makes NONE unviable for a normal corpus.

SEMANTIC chunking: empirical limit of 1 MB per file → fails on most technical documentation.

S3 Vectors: 2,048-byte filterable metadata limit → fixed by declaring nonFilterableMetadataKeys when creating the index.

Sonnet 4.6/4.5/Opus 4.x are not on the Bedrock Evaluations judge allowlist → use Nova Pro as a cross-family judge.

Winners on a real corpus: Custom (0.94), Hierarchical (0.92), Fixed (0.88) on Correctness. NONE and SEMANTIC failed at ingestion before they could be evaluated fairly.

Production recommendation: start with FIXED_SIZE (max_tokens=512, overlap=20%) + S3 Vectors + periodic evaluation. Change only if the data justifies the complexity.

The Context: Why This Matters to Me

I've been building RAGs on top of Bedrock Knowledge Bases across several projects, and every time it's time to configure chunking the same conversation shows up. Someone on the team asks "hierarchical or semantic?", another says "let's try fixed, it sounds safest", and in the end the decision gets made on intuition, not evidence.

The problem with that approach is that when the RAG doesn't work well in production, we don't know whether it was the chunking, the embedding, the retrieval, or the generator. We're debugging in the dark.

My goal with this benchmark was twofold:

Produce reproducible data that any team can use to justify an architecture decision.
Isolate chunking as the single variable so the results are honest.

Additional spoiler: nailing that second part was harder than I expected.

The 5 Chunking Strategies (And an Important Clarification)

Before jumping into results, let's align on what these 5 strategies are. According to the official Amazon Bedrock documentation, the options available in ChunkingConfiguration are:

Strategy	What it does
`NONE`	Doesn't chunk. Each file is treated as a single chunk.
`FIXED_SIZE`	Splits text into chunks of a configurable approximate size (tokens), with overlap.
`HIERARCHICAL`	Splits the document into two layers: large "parent" chunks and smaller "child" chunks derived from them.
`SEMANTIC`	Splits based on semantic similarity between sentences using an embedding model.
`CUSTOM` (Lambda)	Your own chunking logic executed as a Lambda transformation.

🔍 ProTip #1: In many places you'll see "multimodal chunking" mentioned as a sixth strategy. It's not. Multimodal chunking (audio, video, images) happens at the embedding model level (e.g., Nova multimodal embeddings) and its configuration is independent of ChunkingConfiguration. The 5 strategies above apply only to text documents, even if you have multimodal content in your data source. I see this confusion a lot with architects.

The Setup: Isolating Chunking as the Only Variable

The thesis of the benchmark is simple: if you're going to compare chunking strategies, everything else has to be identical across KBs. Any other variable contaminates the results.

So all 5 Knowledge Bases share:

The same corpus in S3 (3 files)
The same embedding model: amazon.titan-embed-text-v2:0, 1024 dimensions
The same vector store: Amazon S3 Vectors (more on this later)
The same generator model: us.anthropic.claude-sonnet-4-6 via inference profile
The same judge model: amazon.nova-pro-v1:0
The same set of 25 questions with ground truth

The only thing that changes between KBs: the ChunkingConfiguration.

Why S3 Vectors as the backend?

When I started putting this infrastructure together, I originally pointed at OpenSearch Serverless, which is the default backend when you create a KB from the console. I did the cost math:

Backend	Base cost to keep the infrastructure up
OpenSearch Serverless (vector collection)	~$11.52 USD/day (floor of 2 OCUs × $0.24/hour, mandatory minimum in production for vector collections)
S3 Vectors	$0 base — you only pay storage ($0.06/GB/month), PUT ($0.20/GB), and queries ($2.5/M API calls + $/TB processed)

For a benchmark involving several iterations and potential debugging, that difference is decisive. Amazon S3 Vectors reached GA on December 2, 2025 and integrates natively with Bedrock Knowledge Bases. Storage costs $0.06/GB/month, PUT costs $0.20/GB of logical data uploaded, and queries are billed per API call ($2.50/M) plus $/TB processed. There's no base cost to keep the infrastructure up — unlike OpenSearch OCUs, no compute is running when you're not using the service.

🔍 ProTip #2: S3 Vectors has three trade-offs you should know before choosing it:

Latency: 100-800ms vs 10-100ms on OpenSearch.

Semantic search only: does not support hybrid search in Bedrock KB (confirmed in the official documentation).

Limited metadata: max 1 KB of custom metadata and 35 keys per vector when used with Bedrock KB. If you use HIERARCHICAL chunking with high token counts, AWS explicitly warns you may exceed the metadata limits because parent-child relationships are stored as non-filterable metadata.

For an offline benchmark this doesn't matter. For production with exact keyword matching or rich metadata, you probably want OpenSearch. Use S3 Vectors when you prioritize cost over extreme latency.

The Corpus

I chose 3 documents with different structures on purpose, to stress different assumptions:

File	Size	Approx. characters	Structure	Initial hypothesis
`well-architected-framework.pdf`	14 MB	~2,530,000	Strongly hierarchical (6 pillars → principles → practices)	Should favor `HIERARCHICAL`
`bedrock-agentcore-dg.pdf`	17 MB	~2,400,000	Dense technical prose with subtle topic shifts	Should favor `SEMANTIC`
`blog-rag-evaluation.html`	1 MB	~1,080,000	Long narrative blog-style	Should expose the limits of `FIXED_SIZE`

As I'll show later, none of those initial hypotheses survived the first ingestion attempt. And that was precisely the most important finding.

Finding #1: `NONE` Isn't as Innocent as It Sounds

My first attempt to ingest the corpus with the NONE strategy threw this error:

Malformed input request: expected maxLength: 50000, actual: 2530200,
please reformat your input and try again. 
(Service: BedrockRuntime, Status Code: 400)
Issue occurred while processing file: well-architected-framework.pdf

I'll admit it took me a second to understand what was going on.

The NONE strategy tells Bedrock not to chunk: the full document gets sent to the embedding model as a single request. And here's the crucial detail: according to the official Amazon Titan Text Embeddings V2 documentation, the model accepts "up to 8,192 tokens or 50,000 characters".

My Well-Architected PDF has 2.5 million characters. Fifty times the limit.

What does this mean in practice?

The NONE strategy is perfectly valid, but only if your corpus is already pre-chunked. That is, only if each file in your S3 bucket is a small logical unit (an FAQ, a product, a ticket, a glossary definition) that fits within those 50,000 characters.

The documentation itself acknowledges this, though subtly:

"If you choose this option [NONE], you may want to pre-process your documents by splitting them into separate files."

But the key word here is "may." In reality it's a "must."

🎯 ProTip #3: When you see the NONE option in the Bedrock console, mentally translate it to PRE_CHUNKED. It's not "no chunking": it's "chunking delegated to you, before uploading to S3." If your corpus is normal technical PDFs, NONE will fail. If it's a database of frequently asked questions with one question per file, it's perfect.

Result: with my corpus, NONE indexed 1 out of 3 documents (the 1 MB HTML also exceeded the limit in many places, but it processed something). Both PDFs failed completely.

Finding #2: `SEMANTIC` Has a 1MB Per-File Limit That Isn't Documented Where You Pick It

I moved to the next strategy with some expectations. SEMANTIC chunking analyzes text with an auxiliary embedding model and detects "breakpoints" between sentences where the topic shifts. Sounds good for dense technical documentation with subtle topic changes, right?

The ingestion log told me otherwise:

File body text exceeds size limit of 1000000 for semantic chunking.
[Files: s3://.../bedrock-agentcore-dg.pdf, 
        s3://.../well-architected-framework.pdf]

Not one billion. One million characters. Per file.

Why is this a problem?

I went through the chunking documentation carefully. It describes the semantic chunking parameters (max tokens, buffer size, breakpoint percentile threshold). It talks about the additional costs of using a foundation model. But the 1 MB per-file limit is not mentioned on the screen where you pick the strategy. You discover it when ingestion fails.

And it's a practical, not theoretical, limit: an average AWS developer guide already exceeds that size. A normal whitepaper exceeds it. Practically any real technical documentation over ~200-300 pages exceeds it.

⚠️ ProTip #4: If you have large technical documentation and want to use SEMANTIC chunking, you'll have to do pre-splitting yourself before uploading to S3. Which has an interesting irony: you're manually chunking so you can use the "semantic" chunking strategy. For most real enterprise corpora (manuals, policies, whitepapers), SEMANTIC isn't viable without significant preprocessing.

Result: SEMANTIC also indexed 1 out of 3 documents (only the blog HTML, which was just under the limit).

The Qualitative Cut Before Measuring Quality

After the first two findings, I already had half the benchmark story before running a single evaluation. This is the table nobody shows you when comparing chunking strategies:

Strategy	Documents indexed	Why
`NONE`	1 / 3	Fails on files > 50,000 characters
`FIXED_SIZE`	3 / 3	✅ No practical size restrictions
`HIERARCHICAL`	3 / 3	✅ No practical size restrictions
`SEMANTIC`	1 / 3	Fails on files > 1,000,000 characters
`CUSTOM`	3 / 3	✅ (after solving the 3 gotchas we'll see below)

Before even evaluating retrieval quality, only 3 of the 5 strategies can ingest normal technical documentation without preprocessing. This is the takeaway you should leave with even if you read nothing else from the article.

The 7 Infrastructure Gotchas Nobody Documents Together

Before showing the quantitative numbers, I need to tell you about something that took me longer than expected: the infrastructure problems that came up when trying to deploy everything with Terraform. There are 7 in total, and they're the kind of thing you only discover when you sit down to do it from scratch, without the console helping you.

I'm leaving them here because anyone trying to reproduce this benchmark will hit several of them, and having them consolidated in one place saves a lot of time.

Gotcha #1: Why does ingestion fail with "Filterable metadata must have at most 2048 bytes"?

On the first ingestion attempt, all 5 KBs failed with the same error:

Invalid record for key '<uuid>': 
Filterable metadata must have at most 2048 bytes
(Service: S3Vectors, Status Code: 400)

S3 Vectors has a 2,048-byte limit on "filterable" metadata per vector. By default, Bedrock KB puts two things in as filterable: AMAZON_BEDROCK_TEXT (the chunk text) and AMAZON_BEDROCK_METADATA (document metadata). Almost any reasonably sized chunk exceeds 2 KB with the text alone.

The fix: when creating the S3 Vectors index, explicitly declare those fields as non-filterable:

resource "aws_s3vectors_index" "strategies" {
  # ... other fields ...
  metadata_configuration {
    non_filterable_metadata_keys = [
      "AMAZON_BEDROCK_TEXT",
      "AMAZON_BEDROCK_METADATA",
    ]
  }
}

🚨 ProTip #5: S3 Vectors indexes are immutable. If you create an index without this setting and realize later, there's no way to edit it: you have to terraform destroy and apply again. Verify this before provisioning.

Gotchas #2-4: Why does the CUSTOM chunker Lambda fail with "Access denied for lambda:InvokeFunction"?

Setting up a Lambda chunker sounds straightforward on paper: write the code, give it IAM permissions, done. In practice, I had to solve three distinct problems that manifest with very similar errors. If you fix only one or two of them, it keeps failing with what looks like the same message.

Problem 1: Missing aws_lambda_permission

First error:

Access denied for lambda:InvokeFunction for Lambda function ARN
arn:aws:lambda:us-east-1:...:function:...-chunker:$LATEST.

Giving the KB's IAM role a lambda:InvokeFunction permission isn't enough. Lambda also requires that the function have a resource-based policy allowing bedrock.amazonaws.com to invoke it:

resource "aws_lambda_permission" "bedrock_invoke" {
  statement_id  = "AllowBedrockKBInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.custom_chunker.function_name
  principal     = "bedrock.amazonaws.com"
  source_arn    = "arn:aws:bedrock:${var.aws_region}:${data.aws_caller_identity.current.account_id}:knowledge-base/*"
}

When you create the KB through the console, AWS generates this permission automatically. With raw Terraform, you have to declare it explicitly.

Problem 2: The KB role's Resource needs to include the qualifier wildcard

With the resource-based permission in place, the next attempt failed with the same message. The subtle difference: now the problem is on the KB's IAM role side.

The reason: Bedrock invokes the Lambda using the qualified ARN <arn>:$LATEST, not the base ARN. If your policy says:

Resource = aws_lambda_function.custom_chunker.arn

IAM doesn't match. The fix is to include both:

Resource = [
  aws_lambda_function.custom_chunker.arn,
  "${aws_lambda_function.custom_chunker.arn}:*",
]

Problem 3: The handler contract uses relative keys, not S3 URIs

With the two IAM issues fixed, the Lambda finally got invoked. And it blew up with:

ValueError: Invalid S3 URI: intermediate/.../well-architected-framework_1.JSON

The examples floating around show event["inputFiles"][*]["contentBatches"][*]["key"] treated as if it were an s3://bucket/key URI. It isn't. Bedrock sends only the key path relative to the intermediate bucket, which you get in event["bucketName"]:

def handler(event, context):
    intermediate_bucket = event["bucketName"]
    for input_file in event["inputFiles"]:
        for batch in input_file["contentBatches"]:
            key = batch["key"]  # relative path, NOT a URI
            response = s3.get_object(Bucket=intermediate_bucket, Key=key)
            # ... run chunking ...
            s3.put_object(Bucket=intermediate_bucket, Key=output_key, Body=...)
            # Output: key, NOT URI
            processed_batches.append({"key": output_key})

🔧 ProTip #6: To have a working CUSTOM chunker deployed with Terraform you need all three fixes together. Solving just one or two produces errors similar enough that they send you off debugging the wrong thing. If yours doesn't work first try, check all three before assuming it's something else.

Gotcha #5: Why doesn't Sonnet 4.6 show up as a valid judge model in Bedrock Evaluations?

When I tried to use Sonnet 4.6 as a judge for the evaluations:

ValidationException: The requested evaluator model(s)
us.anthropic.claude-sonnet-4-6 are not supported.

Falling back to Sonnet 3.7:

ValidationException: Access denied. This Model is marked by provider as
Legacy and you have not been actively using the model in the last 30 days.

Bedrock Evaluations maintains a fixed allowlist of models allowed to act as judge. According to the official documentation verified as of April 2026, the list is:

amazon.nova-pro-v1:0
anthropic.claude-3-5-sonnet-20240620-v1:0
anthropic.claude-3-5-sonnet-20241022-v2:0
anthropic.claude-3-7-sonnet-20250219-v1:0
anthropic.claude-3-haiku-20240307-v1:0
anthropic.claude-3-5-haiku-20241022-v1:0
meta.llama3-1-70b-instruct-v1:0
mistral.mistral-large-2402-v1:0

Three important observations:

Sonnet 4.6 isn't on the list. Neither is Sonnet 4.5 or Opus 4.x. The allowlist runs two generations behind the state of the art.
The Bedrock console shows any available inference profile when picking a judge, including models that will later be rejected. Validation happens server-side in CreateEvaluationJob.
Supported models can become unusable through disuse. If a model is marked Legacy and your account hasn't invoked it in 30 days, Bedrock denies it even though it's on the allowlist.

My fix: use amazon.nova-pro-v1:0 as the judge. Beyond being on the official list, it gave me something technically more defensible for the article: a cross-family judge (AWS Nova evaluating responses from Anthropic Sonnet 4.6), which reduces intra-family self-evaluation bias.

🎓 ProTip #7: Adopt cross-family judging as a pattern, not just because of AWS's limitations but because it's methodologically stronger. "Claude evaluating Claude" is a valid critique in academic papers. Nova evaluating Claude (or vice versa) eliminates that critique.

Gotcha #6: Why does the eval job fail with "metric name Builtin.ContextRelevance is not available"?

My next attempt, after fixing the judge:

ValidationException: The metric name Builtin.ContextRelevance is not available
for RAG retrieveAndGenerate evaluations.

Bedrock Evaluations splits built-in RAG metrics into two mutually exclusive sets depending on the job type:

Metric	`retrieveAndGenerate` (end-to-end)	`retrieve` (retrieval only)
`Builtin.Correctness`	✅	❌
`Builtin.Completeness`	✅	❌
`Builtin.Helpfulness`	✅	❌
`Builtin.Faithfulness`	✅	❌
`Builtin.ContextRelevance`	❌	✅
`Builtin.ContextCoverage`	❌	✅

If you send a metric from the wrong set, the entire job fails, even if the other metrics do apply to the job type.

There's also an important nuance about retrieveAndGenerate: this job type produces scores that combine both retrieval and generation. That's why Correctness and Faithfulness can drop at the same time when retrieval fails (as we'll see in Observation 3). To isolate whether the problem is in retrieval or in the generator, you also need to run a retrieve-only job with ContextRelevance and ContextCoverage.

The official documentation does separate metrics by job type, but many examples and blogs list all 6 in the same list, which leads to the mistake.

💡 ProTip #8: For a complete benchmark you need two jobs per KB: one retrieveAndGenerate with the 4 generation metrics, and another retrieve with the 2 retrieval metrics. That doubles the cost and time of evaluation. In this benchmark I ran only the end-to-end jobs; a follow-up would be running retrieve-only as well to get all 6 metrics.

Gotcha #7: Why does Bedrock Evaluations say "does not have permission to call the KB API" even when the policies look correct?

Last gotcha. With everything above fixed, the eval jobs kept failing:

The provided role does not have permission to call the KB API.

The message makes you think it's a permissions policy issue. In reality it's two things:

Trust policy: the aws:SourceArn must include the evaluation jobs pattern:

   "Condition": {
     "StringLike": {
       "aws:SourceArn": "arn:aws:bedrock:us-east-1:<account>:evaluation-job/*"
     }
   }

Permission policy: the ARNs of the KBs the job will query must be specific, not wildcarded:

   "Resource": [
     "arn:aws:bedrock:us-east-1:<account>:knowledge-base/<kb-id-1>",
     ...
   ]

Either one missing produces the same generic error. It sends you looking for the bug in the wrong place.

🔍 ProTip #9: When Bedrock Evaluations tells you "does not have permission to call the KB API", always check both sides of IAM: trust policy AND permission policy. It's not the same as when other AWS services throw that error.

Adding Up the Gotchas

The 7 problems cost me several hours of debugging. All of them are fixable and all of them are resolved in the repository with the full Terraform code. But it's worth documenting them together because nobody had done it before and because anyone replicating this will trip over at least 3 of them.

Now, the benchmark numbers.

The Quantitative Results

25 questions with ground truth. 5 Knowledge Bases. 125 prompts to the generator (Claude Sonnet 4.6) and close to 500 judgments from the evaluator (Nova Pro). Scores are the per-metric average across the 25 questions:

Figure 1: Average scores per chunking strategy over 25 questions with ground truth. The "cliff" between the top group (Custom, Hierarchical, Fixed) and the bottom group (None, Semantic) is caused by ingestion failures, not by intrinsic chunking quality.

Strategy	Correctness	Completeness	Helpfulness	Faithfulness
custom	0.940	0.790	0.873	0.820
hierarchical	0.920	0.750	0.887	0.810
fixed	0.880	0.760	0.880	0.810
none	0.261	0.210	0.710	0.228
semantic	0.160	0.104	0.580	0.140

Let me share five observations with the data in hand.

Observation 1: There Are Two Groups, Not a Continuous Ranking

Fixed, Hierarchical and Custom sit between 0.75 and 0.94 across all metrics. None and Semantic sit between 0.10 and 0.71. The Correctness gap between the third place of the top group (Fixed, 0.880) and the best of the bottom group (None, 0.261) is 0.619 points.

That doesn't get explained by statistical variance. It's a qualitative cut produced by the ingestion limits I documented above. The low scores for None and Semantic are not a judgment on those strategies' quality: they're the arithmetic consequence of not being able to index 2 out of 3 documents.

If you'd only looked at this table without the ingestion context, you'd have concluded that Semantic chunking is terrible. And that would be a false conclusion. What's terrible is trying to apply Semantic chunking to a corpus that exceeds its operational limit.

Observation 2: Among the 3 "Good" Strategies, the Margin Is Small

Custom wins 3 of 4 metrics (Correctness, Completeness, Faithfulness).
Hierarchical wins Helpfulness by 0.007 over Fixed (basically a tie).
Gap between first (Custom, 0.940) and third (Fixed, 0.880) on Correctness: 0.060.

A 0.06 margin is measurable but not overwhelming. My custom chunker (a markdown-aware recursive character splitter) is doing something useful, but it doesn't justify the operational cost of the Lambda for a generic corpus: Fixed gives you 94% of the result without the 3 IAM gotchas, without the Lambda cost, and without the extra debugging.

🎯 ProTip #10: A custom chunker is only worth it if you have a very specific document format where the generic chunker breaks domain-meaningful semantic units (source code, call transcripts, structured logs, contracts with numbered clauses). For standard technical documentation, Fixed wins by operational simplicity.

Observation 3: Faithfulness Is the Most Discriminative Metric

Look at the difference between Correctness and Faithfulness for the strategies that failed:

Strategy	Correctness	Faithfulness	Difference
none	0.261	0.228	-0.033
semantic	0.160	0.140	-0.020

Faithfulness drops harder than Correctness when the KB doesn't have the content. Why? Because an answer can be correct without being grounded in the retrieved context.

When the KB doesn't have the relevant document indexed, Sonnet 4.6 still produces an answer using its parametric knowledge. If the answer happens to match ground truth, Correctness gives it a decent score. But Faithfulness measures whether the answer is supported by what the KB returned, and the KB didn't return anything useful. That's why Faithfulness collapses.

🔍 ProTip #11: If you're diagnosing a RAG that appears to give correct but "suspicious" answers, Faithfulness is the metric that will confirm what you intuit. A Faithfulness drop is the earliest indicator that your KB isn't pulling the real context — more sensitive than Correctness.

Observation 4: `SEMANTIC` Ended Up Worse Than `NONE`. The Counterintuitive Analysis

💡 Key finding: When a chunking strategy can't ingest most of the corpus, fine chunking amplifies the noise of the little it did ingest. Absent chunking unifies it into a giant coherent chunk that's at least interpretable. This isn't a critique of SEMANTIC as a technique — it's a reminder that low scores aren't representative of the strategy in its proper use case.

This was the result that made me stop and think the most. Semantic should be at least as good as None: chunking "semantically" should be better than not chunking.

The data says otherwise. Across all 4 metrics, Semantic sits below None.

My hypothesis, after looking at the data:

Both strategies only managed to index the same file: the blog HTML (1.08 MB). But they do it in different ways:

NONE indexes that HTML as a single giant chunk of about 1 million characters. When retrieval matches on any question related to the blog's content, it retrieves the whole blog as context. Recall is perfect (all the content is there), even though the context is very noisy (most of the chunk doesn't apply to the question).
SEMANTIC subdivides that same HTML into smaller, more coherent chunks. For the ~20 benchmark questions whose topic isn't in the blog (but in the PDFs Semantic couldn't index), retrieval returns small chunks that are superficially relevant but empty of the content the question actually needs. The judge scores the answer as unfaithful (the retrieved context doesn't support it) and incorrect.

In other words: when your strategy can't ingest most of the corpus, fine chunking amplifies the noise of the little it did ingest. Absent chunking unifies it into a giant coherent chunk that is at least interpretable.

This isn't a critique of Semantic as a technique. It's an additional reminder that with a corpus the strategy can't process, no score will be good, and the low scores aren't representative of the strategy in its proper use case either.

Observation 5: Helpfulness Is the Least Useful Metric to Compare Chunking

Look at the range of Helpfulness across strategies:

custom: 0.873
hierarchical: 0.887
fixed: 0.880
none: 0.710
semantic: 0.580

The total range is 0.30 points. Compared to Correctness (range 0.78) and Faithfulness (range 0.68), Helpfulness barely differentiates. Even strategies that indexed almost nothing of the corpus scored between 0.58 and 0.71.

The judge seems to reward "the answer is well written, structured, and useful in itself," regardless of whether it's correct or faithful to the context. It's a metric of form more than substance.

💡 ProTip #12: If you're going to pick 3 metrics to compare chunking strategies, pick Correctness, Faithfulness and Completeness in that order. Helpfulness is useful for measuring the quality of the generator, not of the chunking.

Decision Table: Which Strategy for Your Use Case?

After all this analysis, this is the recommendation I'd give someone today:

Your use case	Recommended strategy	Reason
Technical documentation (whitepapers, developer guides, corporate manuals)	`FIXED_SIZE` (max_tokens=512, overlap=20%)	Ingests everything, high scores, minimal complexity. Covers 80% of cases.
Documents with strongly marked hierarchy (books with chapters/sections, API documentation)	`HIERARCHICAL`	Uses the document's real structure. Small but measurable margin over FIXED_SIZE.
Pre-chunked corpus (each file is an FAQ, a ticket, a product)	`NONE`	Only legitimate case. Each file must be < 50,000 characters.
Corpus of articles/emails/short blogs (each file < 1 MB)	`SEMANTIC`	Preserves natural semantic boundaries. Only if all your files are small.
Very specific format (source code, transcripts, structured logs)	`CUSTOM` (Lambda)	When the generic chunker breaks domain-meaningful semantic units. Make sure you have debugging budget.
Not sure	`FIXED_SIZE`	Seriously. Start here. Measure. Change later if the data justifies the change.

My Personal Recommendation

If I had to build a production RAG with Bedrock Knowledge Bases tomorrow, I'd start with this configuration:

Chunking: FIXED_SIZE, max_tokens=512, overlap=20%
Backend: S3 Vectors (unless I need hybrid search)
Embedding: Titan Text Embeddings v2, 1024 dimensions
Generator: Claude Sonnet 4.6 via inference profile
Evaluation: periodic jobs with Nova Pro as judge (cross-family)

And I'd measure Faithfulness and Correctness on a set of ground-truth questions from day 1. I'd only consider moving to Hierarchical or Custom if the numbers showed a specific gap justifying the added complexity.

Chunking sometimes gets sold as the big lever in RAG. The reality is that what moves the needle most is:

That your strategy can ingest your corpus without manual preprocessing.
That you have a way to measure that it's working.
That you can iterate on that measurement.

Everything else is fine-tuning.

What's Left

This benchmark has a deliberately narrow scope. Possible next steps:

Retrieval-only metrics (ContextRelevance, ContextCoverage) with a second set of eval jobs. I left them out because of the metric partition (gotcha #6).
Parameter grid search within each strategy. What happens if Fixed uses max_tokens=1024 instead of 512? How much does overlap move the needle?
Spanish-language corpus. This benchmark used English documentation. Titan v2 is multilingual, but it would be worth verifying whether the qualitative cut is the same in other languages.
Per-query production cost under realistic traffic patterns. This benchmark measures quality; real-time operational cost deserves its own analysis.

If any of these topics interests you or you'd like to see one covered in a follow-up article, leave me a comment. And if you replicate this benchmark in your own account and find more gotchas or better results, I'd love to hear about it.

Conclusion

Building this benchmark changed how I think about chunking in Bedrock Knowledge Bases. Not because I discovered that one strategy or another is "best", but because it became clear to me that the normal discussion about chunking has the wrong order.

First it matters whether your strategy can ingest your corpus. Then it matters whether your infrastructure is configured correctly. Then it matters to have objective metrics to compare. And only at the end, much later, does the nuance matter of which strategy has 0.06 points more than another on a specific metric.

If this article saves you an afternoon of debugging with infrastructure gotchas, it makes my day. If it helps you make an architecture decision with evidence instead of intuition, even better.

The full benchmark code (Terraform + Python + evaluation questions) is at github.com/codecr/bedrock-chunking-benchmark. Anyone can reproduce the results in their own account for about 18-20 USD, thanks to the near-zero cost of S3 Vectors as a backend.

🚀 Final Pro Tip: If you're going to take a RAG to production, invest time in evaluation before investing time in chunking. A "mediocre" chunking strategy with good evaluation will take you further than the "best" strategy with no way to measure whether it's working.

If you want to dig deeper into related Bedrock capabilities, I invite you to read my articles on Bedrock Evaluations and Bedrock Guardrails, which pair well with this analysis.

See you in the next article! Don't forget to share in the comments if you've had similar experiences configuring Knowledge Bases in production, or if you have questions about any of the findings. Happy building! 🚀

DEV Community

Real Benchmark: 5 Chunking Strategies in Amazon Bedrock Knowledge Bases

The Context: Why This Matters to Me

The 5 Chunking Strategies (And an Important Clarification)

The Setup: Isolating Chunking as the Only Variable

Why S3 Vectors as the backend?

The Corpus

Finding #1: `NONE` Isn't as Innocent as It Sounds

What does this mean in practice?

Finding #2: `SEMANTIC` Has a 1MB Per-File Limit That Isn't Documented Where You Pick It

Why is this a problem?

The Qualitative Cut Before Measuring Quality

The 7 Infrastructure Gotchas Nobody Documents Together

Gotcha #1: Why does ingestion fail with "Filterable metadata must have at most 2048 bytes"?

Gotchas #2-4: Why does the CUSTOM chunker Lambda fail with "Access denied for lambda:InvokeFunction"?

Gotcha #5: Why doesn't Sonnet 4.6 show up as a valid judge model in Bedrock Evaluations?

Gotcha #6: Why does the eval job fail with "metric name Builtin.ContextRelevance is not available"?

Gotcha #7: Why does Bedrock Evaluations say "does not have permission to call the KB API" even when the policies look correct?

Adding Up the Gotchas

The Quantitative Results

Observation 1: There Are Two Groups, Not a Continuous Ranking

Observation 2: Among the 3 "Good" Strategies, the Margin Is Small

Observation 3: Faithfulness Is the Most Discriminative Metric

Observation 4: `SEMANTIC` Ended Up Worse Than `NONE`. The Counterintuitive Analysis

Observation 5: Helpfulness Is the Least Useful Metric to Compare Chunking

Decision Table: Which Strategy for Your Use Case?

My Personal Recommendation

What's Left

Conclusion

Top comments (0)

The Context: Why This Matters to Me

The 5 Chunking Strategies (And an Important Clarification)

The Setup: Isolating Chunking as the Only Variable

Why S3 Vectors as the backend?

The Corpus

Finding #1: NONE Isn't as Innocent as It Sounds

What does this mean in practice?

Finding #2: SEMANTIC Has a 1MB Per-File Limit That Isn't Documented Where You Pick It

Why is this a problem?

The Qualitative Cut Before Measuring Quality

The 7 Infrastructure Gotchas Nobody Documents Together

Gotcha #1: Why does ingestion fail with "Filterable metadata must have at most 2048 bytes"?

Gotchas #2-4: Why does the CUSTOM chunker Lambda fail with "Access denied for lambda:InvokeFunction"?

Gotcha #5: Why doesn't Sonnet 4.6 show up as a valid judge model in Bedrock Evaluations?

Gotcha #6: Why does the eval job fail with "metric name Builtin.ContextRelevance is not available"?

Gotcha #7: Why does Bedrock Evaluations say "does not have permission to call the KB API" even when the policies look correct?

Adding Up the Gotchas

The Quantitative Results

Observation 1: There Are Two Groups, Not a Continuous Ranking

Observation 2: Among the 3 "Good" Strategies, the Margin Is Small

Observation 3: Faithfulness Is the Most Discriminative Metric

Observation 4: SEMANTIC Ended Up Worse Than NONE. The Counterintuitive Analysis

Observation 5: Helpfulness Is the Least Useful Metric to Compare Chunking

Decision Table: Which Strategy for Your Use Case?

My Personal Recommendation

What's Left

Conclusion

Finding #1: `NONE` Isn't as Innocent as It Sounds

Finding #2: `SEMANTIC` Has a 1MB Per-File Limit That Isn't Documented Where You Pick It

Observation 4: `SEMANTIC` Ended Up Worse Than `NONE`. The Counterintuitive Analysis