James Lee

Posted on Jun 18

Part 2 — Why Does One System Need Three Chunking Strategies? And One Document Type Shouldn't Be Chunked At All

#rag #ai #llm #architecture

This article covers the second layer of the full-stack architecture: the Chunking Service. Chunking strategy sets the ceiling for retrieval quality — no matter how good upstream parsing is, if chunking is wrong, nothing downstream can fix it. Core engineering insight: chunking is not a parameter tuning problem. It's a judgment problem about what constitutes the minimum semantic unit of a document.

📦 Source code: production-rag-engineering — esg/services/chunking_service.py

0. The Pain Point

The first version of the system used a single chunking strategy across all documents:

Fixed 512-character chunks for everything.

Miss rate: 15%. Retrieval kept returning semantically incomplete chunks — the content was there, but only half of it.

The first instinct was to tune the parameter: 512 is too small, try 1024? Try 2048?

After a full round of testing, miss rate dropped from 15% to 12% — and then plateaued. No further improvement.

The problem wasn't the parameter. It was using the same ruler to measure two completely different things.

GRI clauses and ESG reports are both PDF text on the surface, but their semantic structures are fundamentally different. Applying the same chunking strategy to both is like using a bread knife to cut tofu — the knife isn't too dull, it's just the wrong tool entirely.

1. What Chunking Actually Needs to Solve

Start by understanding the essential difference between the two document types:

Dimension	Long-form mixed documents (ESG reports)	Structured rule documents (GRI clauses)
Volume	50,000–100,000 words, 200–300 pages	100–500 words per clause
Format	Text + tables + charts mixed	Pure text, clear paragraph boundaries
Semantic unit	One logical point may span multiple paragraphs	Each entry is a complete, self-contained semantic unit
Chunking risk	Too small = cross-paragraph logic gets cut	Truncation = error. There is no "partially correct."

This distinction isn't unique to ESG. Legal statutes vs. case materials. API documentation vs. user feedback. Medical guidelines vs. clinical records. Any system that simultaneously processes "rule documents" and "application documents" will hit the same problem.

Core judgment:

Long-form mixed documents → semantic unit is "paragraph-level logic" → needs fixed size + overlap to preserve context
Structured rule documents → semantic unit is the entry itself → truncation destroys it → fixed chunking is the wrong approach entirely

2. How the Parameters Were Determined

Before finalizing the parameters, we ran a systematic controlled test.

Test set: 100 ESG reports, 500 GRI clauses, covering manufacturing, financial services, and energy sectors.

chunk_size comparison (fixed 200-character overlap):

chunk_size	Recall rate	Primary issue
256 chars	85%	Semantic fragmentation — single logical points split into 3–4 chunks
512 chars	92%	Good coverage for short paragraphs; long logic (e.g., Scope 3's 11 categories) still truncated
1024 chars	91%	Too much irrelevant context included; retrieval noise increases
2000 chars	92%	Covers 85% of long paragraphs completely; same recall as 512 but better completeness

Overlap length comparison (fixed chunk_size 2000 chars):

Overlap	Miss rate	Storage cost increase
100 chars	25%	+5%
200 chars	15%	+10%
300 chars	8%	+15%
500 chars	6%	+25%

300 characters is the Pareto-optimal point: 8% miss rate is acceptable, storage cost only increases by 15%. Going from 300 to 500 only reduces miss rate by 2% while adding another 10% storage cost — not worth it.

Why does 300-character overlap cover most cross-chunk information?

We measured the average length of key information that spans chunk boundaries across 100 reports — it clusters between 100–120 characters. A 300-character overlap covers 95% of cross-boundary descriptions. The remaining 5% is handled by the merge mechanism downstream.

3. Three-Strategy Routing Logic

The final solution uses three strategies, routed by document type:

Document enters Chunking Service
        ↓
[Document Classification] — 3-factor judgment (95% accuracy)
  ├─ Factor 1: Filename (contains "Annual Report / ESG Report" → report type)
  ├─ Factor 2: Page count (> 50 pages → long-form)
  └─ Factor 3: Content features (table density / section headings / domain terms)
        ↓
┌──────────────────────────────────────────────────────┐
│ Type A: Long-form mixed documents (annual ESG reports)│
│ → chunk_size=2000 chars, overlap=300 chars           │
├──────────────────────────────────────────────────────┤
│ Type B: High-density table documents (carbon reports) │
│ → chunk_size=3000 chars, no overlap                  │
├──────────────────────────────────────────────────────┤
│ Type C: Structured rule documents (GRI clauses)       │
│ → Paragraph chunking, each entry becomes its own chunk│
└──────────────────────────────────────────────────────┘

Why does Type B use 3000 characters with no overlap?

Carbon footprint reports are predominantly tables. A table is itself a complete semantic unit. Overlap would copy the last few rows of one table into the beginning of the next chunk — retrieval would then surface two chunks containing the same table fragment, introducing noise. 3000 characters ensures a complete table isn't split across chunks.

How was 95% document classification accuracy achieved?

Three-factor joint judgment — single-factor classification is error-prone:

Filename only: some reports are named "2023_ESG.pdf" with no explicit type indicator
Page count only: GRI standard documents can also exceed 50 pages
Three factors combined: filename + page count + table density and domain terms from the first 3 pages → misclassification rate drops from 15% to 5%

4. Atomic Semantic Units: Why Rule Documents Cannot Be Fixed-Chunked

This is the most important engineering insight in this article. It deserves its own section.

What is an atomic semantic unit?

The complete content of GRI clause 306-3 is:

The organization shall report: (a) the total number and total volume of significant spills; (b) information about significant spills by type: spills on land, spills into bodies of water, spills into groundwater; (c) impacts of significant spills that are recorded in the organization's operational impact assessments; (d) actions taken by the organization to address the consequences of significant spills.

These four elements — count + volume, classification, impact, remediation — form a single whole. If any one is truncated, the system cannot determine whether the company has fully disclosed clause 306-3.

Truncation equals error. There is no "partially correct."

This is why rule documents cannot be fixed-chunked. Fixed chunking operates on the logic of "cut by length." Rule documents require the logic of "cut by entry."

Paragraph boundary detection: rules + model, two layers

Newline-based rule detection → identifies 80% of obvious boundaries (speed: 1s/document)
        ↓
BGE-M3 semantic model → identifies remaining 20% of implicit boundaries
(sudden drop in semantic similarity = logical transition = chunk boundary)
        ↓
Combined accuracy: 95% (10% better than rules alone, 3x faster than model alone)

How are clauses longer than 1000 characters handled?

A small number of GRI clauses exceed 1000 characters — typically those with extensive examples. Handling logic:

Use BGE-M3 to identify internal logical boundaries (e.g., the boundary between "requirements" and "examples")
Split at the boundary into sub-chunks, each still a complete logical unit
Sub-chunks are linked via parent_chunk_id to preserve their relationship

def split_long_clause(text: str, max_size: int = 1000) -> list[dict]:
    if len(text) <= max_size:
        return [{"text": text, "is_split": False}]

    # Use BGE-M3 to find logical boundary
    sentences = split_to_sentences(text)
    split_point = find_semantic_boundary(sentences)  # point of similarity drop

    return [
        {"text": text[:split_point], "is_split": True, "part": 1},
        {"text": text[split_point:], "is_split": True, "part": 2}
    ]

5. Anti-Truncation: Two-Layer Defense

Even with differentiated strategies, cross-chunk truncation still occurs in long-form reports. The defense has two layers: prevent upfront, repair after the fact.

Layer 1: Upfront prevention (active protection during chunking)

We maintain a library of 300+ domain terms (Scope 1/2/3, carbon intensity, biodiversity, GHG Protocol…). During chunking, the system checks whether any term falls on a chunk boundary:

ESG_TERMS = ["Scope 3", "carbon intensity", "biodiversity", "GHG Protocol", ...]

def safe_split(text: str, chunk_size: int, overlap: int) -> list[str]:
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        if end < len(text):
            # Check if any term is being truncated at the boundary
            boundary_text = text[end-50:end+50]  # 50 chars on each side
            for term in ESG_TERMS:
                term_pos = boundary_text.find(term)
                if 0 < term_pos < 50:  # term is being cut at the boundary
                    # Shift split point forward to end of sentence
                    end = find_next_sentence_end(text, end)
                    break

        chunks.append(text[start:end])
        start = end - overlap

    return chunks

Layer 2: Post-hoc repair (automatic merge during retrieval)

Each chunk records its neighbor relationships at write time:

chunk_metadata = {
    "chunk_id": "chunk_245",
    "prev_chunk_id": "chunk_244",
    "next_chunk_id": "chunk_246",
    "page_range": "45-46",
    "similarity_score": None  # filled in at retrieval time
}

At retrieval time, if a retrieved chunk and its neighbor have semantic similarity ≥ 0.7, they are automatically merged into an expanded chunk:

def expand_chunk(chunk: dict, threshold: float = 0.7) -> dict:
    next_chunk = get_chunk(chunk["next_chunk_id"])
    if next_chunk:
        similarity = cosine_similarity(
            chunk["embedding"],
            next_chunk["embedding"]
        )
        if similarity >= threshold:
            return merge_chunks(chunk, next_chunk)
    return chunk

Why 0.7 as the threshold?

Calibrated against 500 reports:

Threshold 0.6: over-merges — pulls in unrelated adjacent chunks, introduces noise
Threshold 0.7: precise merging of semantically continuous chunks, false merge rate < 5%
Threshold 0.8: under-merges — cross-chunk descriptions like Scope 3 categories still get missed

Anti-truncation results: miss rate 30% → 8%, answer completeness 70% → 92%.

6. How the Two Chunking Strategies Work Together at Retrieval Time

The two chunking strategies don't operate independently — they have a clear division of labor at retrieval time.

Rule chunks serve as "standard anchors." Report chunks find the "corresponding content."

Example query: "Does this company comply with GRI 305-1 disclosure requirements?"

Step 1 — Standard anchoring (0.2s)
  Embed the query and search the GRI clause library
  → Matches GRI 305-1 paragraph chunk
  → Retrieves: "Must disclose: Scope 1 emissions + calculation method + data source"
  → This becomes the "reference vector" — tells the system what counts as satisfying the requirement

Step 2 — Content matching (0.5s)
  Use the 305-1 clause chunk embedding to search the ESG report vector store
  → Rank by similarity, retrieve Top 3
  → Chunk with similarity 0.85: "Scope 1 emissions: 5,000 tonnes, IPCC calculation method,
     data sourced from energy invoices"

Step 3 — Context expansion (0.3s)
  Follow next_chunk_id to adjacent chunk
  → Adjacent chunk similarity 0.82 ≥ 0.7, auto-merge
  → Adds "data verification process" content

Step 4 — Result synthesis (1.4s)
  Send "305-1 disclosure requirements" + "actual report content" to LLM
  → Output: "Scope 1 emissions, calculation method, and data source are all disclosed.
     Compliant with 305-1."

Step 5 — Metadata traceability (0.1s)
  Attached: source = 2023 ESG Report pp.45–46, chunk_id=chunk_245

Total latency: 2.5 seconds

Three-layer false positive filter:

The biggest risk in coordinated retrieval is "high similarity but semantically unrelated" — for example, "energy consumption" and "spill incidents" may be close in vector space but are completely unrelated in business terms.

Layer 1 — Keyword hard match
  When retrieving for GRI 305 (greenhouse gas emissions), retrieved chunks must contain
  at least 2 of: ["Scope 1", "Scope 2", "emissions", "calculation method"]
  → Filters out chunks with high similarity but mismatched keywords

Layer 2 — LLM semantic cross-validation
  For chunks passing Layer 1, ask the LLM:
  "Does this content actually answer the disclosure points required by the clause?"
  → Filters out chunks that "mention emissions but lack calculation methodology"

Layer 3 — Manual spot-check calibration
  Monthly spot-check of 100 retrieval results
  If false positive rate > 5%, trigger keyword library update or threshold adjustment

Three-layer filter results: false positive rate 15% → 3%, accuracy 70% → 91%.

7. Why We Didn't Use Semantic Chunking

Semantic chunking is another common option — use a semantic model to compute sentence boundaries and split at points where similarity drops sharply.

We tested it. The conclusion: in structured document scenarios, the cost-benefit ratio isn't there.

Metric	Multi-strategy chunking	Semantic chunking
Recall rate	92%	94%
Processing speed	15ms/chunk	20ms/chunk (25% slower)
Cost per report	$0.50	$0.70 (40% more expensive)
Development complexity	Medium	High

Semantic chunking is 2% more accurate, but 40% more expensive and 25% slower.

Key judgment: GRI clauses are already structured text with clear paragraph boundaries. The rules + BGE-M3 hybrid approach already identifies 95% of boundaries correctly. Introducing full semantic chunking means paying 40% more cost for a 2% accuracy gain. The ROI isn't there.

Semantic chunking's value becomes apparent when document structure is highly irregular — scanned PDFs, unformatted plain text. For well-structured documents, it's overkill.

8. Wrapping Up: The Chunking Decision Tree

When facing a new chunking scenario, two questions determine the strategy:

Q1: Is the document's minimum semantic unit atomic?
  ├─ Yes (each entry / clause / rule is independently complete)
  │   → Paragraph chunking. Do NOT use fixed chunking.
  │   → For entries > 1000 chars, do a secondary split and retain parent_chunk_id
  └─ No (semantic units span paragraphs, context is required)
      → Go to Q2

Q2: Is the document long-form mixed content (text + tables)?
  ├─ Yes
  │   → Fixed size + overlap (2000 chars + 300 char overlap)
  │   → Add anti-truncation term library + neighbor relationship tracking
  └─ No (predominantly high-density tables)
      → Fixed size + no overlap (3000 chars)
      → Tables are complete semantic units; overlap only introduces noise

Two things every strategy needs, regardless of type:

Each chunk records prev_chunk_id / next_chunk_id — enables merge expansion at retrieval time
Chunk metadata includes page_range and chunk_id — lays the foundation for full-chain traceability in Part 5

Source Code

All implementations referenced in this article are available here:

👉 github.com/muzinan123/production-rag-engineering

Relevant files for this part:

esg/services/chunking_service.py — 4 chunking strategies with document-type routing

Next up: Once documents are chunked and stored in the vector store, retrieval is where the real battle begins. General-purpose embedding models drift on domain-specific terminology — "Scope 1 emissions" and "direct greenhouse gas emissions" are far apart in vector space, but they refer to the same thing. Where exactly does vector retrieval break down? And how do you fix each failure point? → Part 3 — Retrieval

DEV Community