Document loaders and chunking with LangChain

#langchain #rag #ai #node

This post covers local file ingestion and chunking in Node.js. For LangChain basics (LCEL, packages, agents), see the LangChain overview post. For embeddings, pgvector, and the full RAG flow, see the RAG with pgvector post - it uses one splitter inline; this post goes deeper on loaders and splitter choice.

Prerequisites

Node.js version 26
langchain, @langchain/core, @langchain/classic, and @langchain/textsplitters installed

npm i langchain @langchain/core @langchain/classic @langchain/textsplitters

More loader types (web, cloud, audio) live in standalone integration packages - see the document loader integrations page.

The Document type

Every loader returns Document instances from @langchain/core:

pageContent - the text of the chunk or file
metadata - optional key/value pairs (source path, section, page) used for citations

import { Document } from '@langchain/core/documents';

const doc = new Document({
  pageContent: 'pgvector adds vector search to PostgreSQL.',
  metadata: { source: 'notes/pgvector.txt', section: 'basics' }
});

Load a single file

Use TextLoader for plain text or markdown files:

import { TextLoader } from '@langchain/classic/document_loaders/fs/text';

const loader = new TextLoader('./notes/pgvector.txt');
const docs = await loader.load();

console.log(docs[0].pageContent);
console.log(docs[0].metadata.source);

The loader sets metadata.source to the file path - keep it for citations in RAG answers.

Load a directory

Use DirectoryLoader when you have many files. Map extensions to loader factories:

import { DirectoryLoader } from '@langchain/classic/document_loaders/fs/directory';
import { TextLoader } from '@langchain/classic/document_loaders/fs/text';

const loader = new DirectoryLoader('./notes', {
  '.txt': (path) => new TextLoader(path),
  '.md': (path) => new TextLoader(path)
});

const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);

PDF, CSV, and JSON loaders are available via other integration packages. This post uses .txt and .md files.

Split documents

Chunking makes retrieval more precise. Instead of embedding one large file, split it into smaller overlapping parts. Pass the docs array from TextLoader or DirectoryLoader to a splitter:

Two parameters matter most:

chunkSize - target maximum size per chunk (characters or tokens, depending on splitter)
chunkOverlap - shared text between adjacent chunks so context is not lost at boundaries

Start with chunkSize: 800 and chunkOverlap: 120, then tune based on document style and answer quality.

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120
});

const chunks = await splitter.splitDocuments(docs);
console.log(chunks.length);

Splitter comparison

The example above uses RecursiveCharacterTextSplitter, the default for most RAG setups. Alternatives:

Splitter	Best for
`RecursiveCharacterTextSplitter`	Default choice; tries paragraphs, then sentences, then words
`CharacterTextSplitter`	Fixed character windows when structure does not matter
`TokenTextSplitter`	When chunk limits must match model token budgets

Character-based:

import { CharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new CharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120
});

const chunks = await splitter.splitDocuments(docs);

Token-based:

import { TokenTextSplitter } from '@langchain/textsplitters';

const splitter = new TokenTextSplitter({
  encodingName: 'cl100k_base',
  chunkSize: 200,
  chunkOverlap: 20
});

const chunks = await splitter.splitDocuments(docs);

Use token-based splitting when chunks must fit within a model's context window. Character-based recursive splitting is the usual starting point for RAG over prose.

Metadata through the pipeline

Pass metadata when creating documents manually, or rely on loader metadata - splitters preserve it on each chunk:

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 60
});

const chunks = await splitter.createDocuments(
  ['First paragraph.\n\nSecond paragraph.'],
  [{ source: 'manual', section: 'intro' }]
);

console.log(chunks[0].metadata);

After splitDocuments(docs), each chunk keeps fields like source from the parent document. Use those fields when storing chunks in a vector database or displaying citations.

Choosing parameters

Short FAQs or API docs - smaller chunkSize (300–500) for precise retrieval
Long guides or blog posts - larger chunkSize (800–1200) to keep sections together
More overlap - helps when answers span chunk boundaries; increases storage and embedding cost
Less overlap - fewer redundant chunks; risk losing context at splits

Tune with real questions from your domain.