DEV Community

yang yaru
yang yaru

Posted on

Designing a Scalable Knowledge Base for Large Language Models

A Practical Engineering Guide to Cleaning, Semantic Chunking, Metadata, and Batch Embeddings

Large Language Model (LLM) knowledge bases are often misunderstood as simply “vectorizing documents.” In reality, a production-grade knowledge system is a retrieval infrastructure that must be traceable, incremental, and measurable.

This article walks through a practical engineering pipeline covering:

  • Data cleaning and normalization
  • Semantic chunking strategies
  • Metadata schema design
  • Batch embedding architecture
  • Retrieval and evaluation considerations

The focus is not theory, but implementation decisions that work in real systems.


1. System Architecture Overview

Before implementation, define the boundaries of your pipeline. A robust LLM knowledge base usually consists of the following stages:

Ingest → Normalize → Chunk → Enrich → Embed → Index → Retrieve → Monitor
Enter fullscreen mode Exit fullscreen mode

Core Responsibilities

  • Ingest: PDFs, web pages, Markdown, databases, or internal docs
  • Normalize: Convert raw content into structured blocks
  • Chunk: Create retrieval-ready units
  • Enrich: Attach metadata and context
  • Embed: Generate vectors with version control
  • Index: Build hybrid search indexes
  • Serve: Retrieval + reranking + citation
  • Monitor: Evaluate retrieval quality continuously

A knowledge base is closer to a search engine than a simple storage system.


2. Data Cleaning and Normalization

The goal is not to “clean aggressively,” but to preserve structural signals.

Required Processing

  • Convert all content to UTF-8
  • Normalize whitespace and line breaks
  • Remove duplicated navigation/footer content
  • Detect headings (H1/H2/H3 or numeric sections)
  • Preserve structural blocks:

    • Paragraphs
    • Lists
    • Tables
    • Code blocks

Avoid flattening everything into plain text. Structure improves both retrieval accuracy and traceability.

Common Noise Sources

  • Web navigation bars and cookie banners
  • PDF headers and repeated page numbers
  • Hyphenated line breaks in scanned PDFs
  • Template content repeated across pages

Tables should ideally be converted into Markdown or key: value rows so that LLMs can interpret them correctly.


3. Semantic Chunking Strategy

Chunking is the most important factor affecting retrieval performance.

Chunking Goals

A good chunk should be:

  • Self-contained: understandable without large context
  • Traceable: linked back to its original location
  • Searchable: not too long or too fragmented

Recommended Hierarchical Approach

  1. Structure-aware splitting (Preferred)
  • Split by document headings first
  • Merge paragraphs inside each section
  1. Recursive splitting
  • Paragraph → Line → Sentence → Token boundary
  1. Semantic boundary detection (Advanced)
  • Use topic shifts or embeddings to find natural breaks

Chunk Size and Overlap

Typical engineering defaults:

  • FAQ or policies: 200–450 tokens, overlap 30–80
  • Technical docs: 300–700 tokens, overlap 50–120
  • Long reports or research: 400–900 tokens, overlap 80–150

Overlap prevents losing context when answers span boundaries.

Parent–Child Chunk Design

A highly effective production pattern:

  • Child chunks: smaller pieces used for vector retrieval
  • Parent chunks: larger contextual sections passed to the LLM

Workflow:

  1. Retrieve child chunks
  2. Expand to parent chunks
  3. Send parents to the model for generation

This significantly improves answer coherence.


4. Metadata Schema Design

Metadata is not optional. It enables filtering, access control, versioning, and debugging.

Minimum Viable Metadata

Each chunk should include:

  • doc_id
  • chunk_id
  • title
  • section_path
  • source_uri
  • page_start / page_end
  • created_at / updated_at
  • language
  • hash (content checksum)
  • tenant/project
  • acl (access control)

Enhanced Metadata (Recommended)

  • doc_version
  • effective_date
  • tags
  • entities (product names, systems, people)
  • content_type (faq, guide, spec, code)
  • parent_id
  • quality_flags

These fields enable advanced filtering and evaluation later.

Stable Chunk ID Strategy

Chunk IDs must remain stable across re-processing.

Example:

chunk_id = sha1(doc_id + doc_version + section_path + chunk_index + text_hash_prefix)
Enter fullscreen mode Exit fullscreen mode

Only changed content should produce new IDs.


5. Batch Embedding Architecture

Embedding pipelines must be idempotent, incremental, and observable.

Suggested Data Model

documents

  • doc_id, version, uri, title, checksum

chunks

  • chunk_id, doc_id, text, metadata_json, hash

embeddings

  • chunk_id, model_name, dim, vector, text_hash

embedding_jobs

  • job_id, status, created_at

embedding_job_items

  • job_id, chunk_id, retry_count, error

Key Engineering Practices

  • Only embed chunks whose hash changed
  • Process in batches (32–256 chunks or token-limited)
  • Control concurrency to avoid rate limits
  • Implement exponential retry
  • Monitor throughput and failure rates

Supporting Multiple Models

Embedding records must include:

  • model_name
  • model_version
  • vector_dimension
  • normalized_flag

Allow multiple embeddings per chunk for gradual migration between models.


6. Retrieval Design: Hybrid Search and Reranking

Vector search alone is rarely sufficient.

Recommended Retrieval Pipeline

  1. Hybrid retrieval:
  • Vector similarity
  • BM25 keyword search

    1. Metadata filtering:
  • tenant/project

  • ACL

  • document type

    1. Reranking:
  • Lightweight reranker or LLM scoring

    1. Source citation:
  • Return source_uri + section_path + page

Hybrid search dramatically improves precision for exact terms and technical names.


7. Chunk Quality Monitoring

Many production issues are caused by poor chunks rather than model failures.

Common anti-patterns:

  • Chunks shorter than 50 tokens
  • Chunks longer than 1200 tokens
  • Repeated template content
  • Missing title context
  • Duplicate sections occupying top results

Add a simple rule engine that tags chunks with quality_flags.


8. End-to-End Processing Pipeline

A practical implementation roadmap:

  1. Ingest documents and generate doc_id
  2. Extract structured blocks
  3. Remove noise and duplicates
  4. Build parent chunks from sections
  5. Generate child chunks with overlap
  6. Attach metadata and hashes
  7. Upsert into chunks table
  8. Create embedding jobs for new/changed chunks
  9. Batch embedding with workers
  10. Build vector and keyword indexes
  11. Run evaluation queries (golden dataset)

Final Thoughts

Designing an LLM knowledge base is less about models and more about information architecture.

The biggest improvements usually come from:

  • Better chunk structure
  • Strong metadata design
  • Incremental embedding pipelines
  • Hybrid retrieval strategies

If you treat your knowledge base like a search system rather than a document dump, both retrieval accuracy and generation quality improve significantly.

Top comments (0)