A Practical Engineering Guide to Cleaning, Semantic Chunking, Metadata, and Batch Embeddings
Large Language Model (LLM) knowledge bases are often misunderstood as simply “vectorizing documents.” In reality, a production-grade knowledge system is a retrieval infrastructure that must be traceable, incremental, and measurable.
This article walks through a practical engineering pipeline covering:
- Data cleaning and normalization
- Semantic chunking strategies
- Metadata schema design
- Batch embedding architecture
- Retrieval and evaluation considerations
The focus is not theory, but implementation decisions that work in real systems.
1. System Architecture Overview
Before implementation, define the boundaries of your pipeline. A robust LLM knowledge base usually consists of the following stages:
Ingest → Normalize → Chunk → Enrich → Embed → Index → Retrieve → Monitor
Core Responsibilities
- Ingest: PDFs, web pages, Markdown, databases, or internal docs
- Normalize: Convert raw content into structured blocks
- Chunk: Create retrieval-ready units
- Enrich: Attach metadata and context
- Embed: Generate vectors with version control
- Index: Build hybrid search indexes
- Serve: Retrieval + reranking + citation
- Monitor: Evaluate retrieval quality continuously
A knowledge base is closer to a search engine than a simple storage system.
2. Data Cleaning and Normalization
The goal is not to “clean aggressively,” but to preserve structural signals.
Required Processing
- Convert all content to UTF-8
- Normalize whitespace and line breaks
- Remove duplicated navigation/footer content
- Detect headings (H1/H2/H3 or numeric sections)
-
Preserve structural blocks:
- Paragraphs
- Lists
- Tables
- Code blocks
Avoid flattening everything into plain text. Structure improves both retrieval accuracy and traceability.
Common Noise Sources
- Web navigation bars and cookie banners
- PDF headers and repeated page numbers
- Hyphenated line breaks in scanned PDFs
- Template content repeated across pages
Tables should ideally be converted into Markdown or key: value rows so that LLMs can interpret them correctly.
3. Semantic Chunking Strategy
Chunking is the most important factor affecting retrieval performance.
Chunking Goals
A good chunk should be:
- Self-contained: understandable without large context
- Traceable: linked back to its original location
- Searchable: not too long or too fragmented
Recommended Hierarchical Approach
- Structure-aware splitting (Preferred)
- Split by document headings first
- Merge paragraphs inside each section
- Recursive splitting
- Paragraph → Line → Sentence → Token boundary
- Semantic boundary detection (Advanced)
- Use topic shifts or embeddings to find natural breaks
Chunk Size and Overlap
Typical engineering defaults:
- FAQ or policies: 200–450 tokens, overlap 30–80
- Technical docs: 300–700 tokens, overlap 50–120
- Long reports or research: 400–900 tokens, overlap 80–150
Overlap prevents losing context when answers span boundaries.
Parent–Child Chunk Design
A highly effective production pattern:
- Child chunks: smaller pieces used for vector retrieval
- Parent chunks: larger contextual sections passed to the LLM
Workflow:
- Retrieve child chunks
- Expand to parent chunks
- Send parents to the model for generation
This significantly improves answer coherence.
4. Metadata Schema Design
Metadata is not optional. It enables filtering, access control, versioning, and debugging.
Minimum Viable Metadata
Each chunk should include:
doc_idchunk_idtitlesection_pathsource_uripage_start / page_endcreated_at / updated_atlanguage-
hash(content checksum) tenant/project-
acl(access control)
Enhanced Metadata (Recommended)
doc_versioneffective_datetags-
entities(product names, systems, people) -
content_type(faq, guide, spec, code) parent_idquality_flags
These fields enable advanced filtering and evaluation later.
Stable Chunk ID Strategy
Chunk IDs must remain stable across re-processing.
Example:
chunk_id = sha1(doc_id + doc_version + section_path + chunk_index + text_hash_prefix)
Only changed content should produce new IDs.
5. Batch Embedding Architecture
Embedding pipelines must be idempotent, incremental, and observable.
Suggested Data Model
documents
- doc_id, version, uri, title, checksum
chunks
- chunk_id, doc_id, text, metadata_json, hash
embeddings
- chunk_id, model_name, dim, vector, text_hash
embedding_jobs
- job_id, status, created_at
embedding_job_items
- job_id, chunk_id, retry_count, error
Key Engineering Practices
- Only embed chunks whose hash changed
- Process in batches (32–256 chunks or token-limited)
- Control concurrency to avoid rate limits
- Implement exponential retry
- Monitor throughput and failure rates
Supporting Multiple Models
Embedding records must include:
- model_name
- model_version
- vector_dimension
- normalized_flag
Allow multiple embeddings per chunk for gradual migration between models.
6. Retrieval Design: Hybrid Search and Reranking
Vector search alone is rarely sufficient.
Recommended Retrieval Pipeline
- Hybrid retrieval:
- Vector similarity
-
BM25 keyword search
- Metadata filtering:
tenant/project
ACL
-
document type
- Reranking:
-
Lightweight reranker or LLM scoring
- Source citation:
Return
source_uri + section_path + page
Hybrid search dramatically improves precision for exact terms and technical names.
7. Chunk Quality Monitoring
Many production issues are caused by poor chunks rather than model failures.
Common anti-patterns:
- Chunks shorter than 50 tokens
- Chunks longer than 1200 tokens
- Repeated template content
- Missing title context
- Duplicate sections occupying top results
Add a simple rule engine that tags chunks with quality_flags.
8. End-to-End Processing Pipeline
A practical implementation roadmap:
- Ingest documents and generate
doc_id - Extract structured blocks
- Remove noise and duplicates
- Build parent chunks from sections
- Generate child chunks with overlap
- Attach metadata and hashes
- Upsert into
chunkstable - Create embedding jobs for new/changed chunks
- Batch embedding with workers
- Build vector and keyword indexes
- Run evaluation queries (golden dataset)
Final Thoughts
Designing an LLM knowledge base is less about models and more about information architecture.
The biggest improvements usually come from:
- Better chunk structure
- Strong metadata design
- Incremental embedding pipelines
- Hybrid retrieval strategies
If you treat your knowledge base like a search system rather than a document dump, both retrieval accuracy and generation quality improve significantly.
Top comments (0)