yang yaru

Posted on Feb 11

Designing a Scalable Knowledge Base for Large Language Models

#ai #llm

A Practical Engineering Guide to Cleaning, Semantic Chunking, Metadata, and Batch Embeddings

Large Language Model (LLM) knowledge bases are often misunderstood as simply “vectorizing documents.” In reality, a production-grade knowledge system is a retrieval infrastructure that must be traceable, incremental, and measurable.

This article walks through a practical engineering pipeline covering:

Data cleaning and normalization
Semantic chunking strategies
Metadata schema design
Batch embedding architecture
Retrieval and evaluation considerations

The focus is not theory, but implementation decisions that work in real systems.

1. System Architecture Overview

Before implementation, define the boundaries of your pipeline. A robust LLM knowledge base usually consists of the following stages:

Ingest → Normalize → Chunk → Enrich → Embed → Index → Retrieve → Monitor

Core Responsibilities

Ingest: PDFs, web pages, Markdown, databases, or internal docs
Normalize: Convert raw content into structured blocks
Chunk: Create retrieval-ready units
Enrich: Attach metadata and context
Embed: Generate vectors with version control
Index: Build hybrid search indexes
Serve: Retrieval + reranking + citation
Monitor: Evaluate retrieval quality continuously

A knowledge base is closer to a search engine than a simple storage system.

2. Data Cleaning and Normalization

The goal is not to “clean aggressively,” but to preserve structural signals.

Required Processing

Convert all content to UTF-8
Normalize whitespace and line breaks
Remove duplicated navigation/footer content
Detect headings (H1/H2/H3 or numeric sections)
Preserve structural blocks:
- Paragraphs
- Lists
- Tables
- Code blocks

Avoid flattening everything into plain text. Structure improves both retrieval accuracy and traceability.

Common Noise Sources

Web navigation bars and cookie banners
PDF headers and repeated page numbers
Hyphenated line breaks in scanned PDFs
Template content repeated across pages

Tables should ideally be converted into Markdown or key: value rows so that LLMs can interpret them correctly.

3. Semantic Chunking Strategy

Chunking is the most important factor affecting retrieval performance.

Chunking Goals

A good chunk should be:

Self-contained: understandable without large context
Traceable: linked back to its original location
Searchable: not too long or too fragmented

Recommended Hierarchical Approach

Structure-aware splitting (Preferred)

Split by document headings first
Merge paragraphs inside each section

Recursive splitting

Paragraph → Line → Sentence → Token boundary

Semantic boundary detection (Advanced)

Use topic shifts or embeddings to find natural breaks

Chunk Size and Overlap

Typical engineering defaults:

FAQ or policies: 200–450 tokens, overlap 30–80
Technical docs: 300–700 tokens, overlap 50–120
Long reports or research: 400–900 tokens, overlap 80–150

Overlap prevents losing context when answers span boundaries.

Parent–Child Chunk Design

A highly effective production pattern:

Child chunks: smaller pieces used for vector retrieval
Parent chunks: larger contextual sections passed to the LLM

Workflow:

Retrieve child chunks
Expand to parent chunks
Send parents to the model for generation

This significantly improves answer coherence.

4. Metadata Schema Design

Metadata is not optional. It enables filtering, access control, versioning, and debugging.

Minimum Viable Metadata

Each chunk should include:

doc_id
chunk_id
title
section_path
source_uri
page_start / page_end
created_at / updated_at
language
hash (content checksum)
tenant/project
acl (access control)

Enhanced Metadata (Recommended)

doc_version
effective_date
tags
entities (product names, systems, people)
content_type (faq, guide, spec, code)
parent_id
quality_flags

These fields enable advanced filtering and evaluation later.

Stable Chunk ID Strategy

Chunk IDs must remain stable across re-processing.

Example:

chunk_id = sha1(doc_id + doc_version + section_path + chunk_index + text_hash_prefix)

Only changed content should produce new IDs.

5. Batch Embedding Architecture

Embedding pipelines must be idempotent, incremental, and observable.

Suggested Data Model

documents

doc_id, version, uri, title, checksum

chunks

chunk_id, doc_id, text, metadata_json, hash

embeddings

chunk_id, model_name, dim, vector, text_hash

embedding_jobs

job_id, status, created_at

embedding_job_items

job_id, chunk_id, retry_count, error

Key Engineering Practices

Only embed chunks whose hash changed
Process in batches (32–256 chunks or token-limited)
Control concurrency to avoid rate limits
Implement exponential retry
Monitor throughput and failure rates

Supporting Multiple Models

Embedding records must include:

model_name
model_version
vector_dimension
normalized_flag

Allow multiple embeddings per chunk for gradual migration between models.

6. Retrieval Design: Hybrid Search and Reranking

Vector search alone is rarely sufficient.

Recommended Retrieval Pipeline

Hybrid retrieval:

Vector similarity
BM25 keyword search
1. Metadata filtering:
tenant/project
ACL
document type
1. Reranking:
Lightweight reranker or LLM scoring
1. Source citation:
Return source_uri + section_path + page

Hybrid search dramatically improves precision for exact terms and technical names.

7. Chunk Quality Monitoring

Many production issues are caused by poor chunks rather than model failures.

Common anti-patterns:

Chunks shorter than 50 tokens
Chunks longer than 1200 tokens
Repeated template content
Missing title context
Duplicate sections occupying top results

Add a simple rule engine that tags chunks with quality_flags.

8. End-to-End Processing Pipeline

A practical implementation roadmap:

Ingest documents and generate doc_id
Extract structured blocks
Remove noise and duplicates
Build parent chunks from sections
Generate child chunks with overlap
Attach metadata and hashes
Upsert into chunks table
Create embedding jobs for new/changed chunks
Batch embedding with workers
Build vector and keyword indexes
Run evaluation queries (golden dataset)

Final Thoughts

Designing an LLM knowledge base is less about models and more about information architecture.

The biggest improvements usually come from:

Better chunk structure
Strong metadata design
Incremental embedding pipelines
Hybrid retrieval strategies

If you treat your knowledge base like a search system rather than a document dump, both retrieval accuracy and generation quality improve significantly.

DEV Community