DEV Community

Cover image for Document loaders and chunking with LangChain
Željko Šević
Željko Šević

Posted on • Originally published at sevic.dev

Document loaders and chunking with LangChain

This post covers local file ingestion and chunking in Node.js. For LangChain basics (LCEL, packages, agents), see the LangChain overview post. For embeddings, pgvector, and the full RAG flow, see the RAG with pgvector post - it uses one splitter inline; this post goes deeper on loaders and splitter choice.

Prerequisites

  • Node.js version 26
  • langchain, @langchain/core, @langchain/classic, and @langchain/textsplitters installed
npm i langchain @langchain/core @langchain/classic @langchain/textsplitters
Enter fullscreen mode Exit fullscreen mode

More loader types (web, cloud, audio) live in standalone integration packages - see the document loader integrations page.

The Document type

Every loader returns Document instances from @langchain/core:

  • pageContent - the text of the chunk or file
  • metadata - optional key/value pairs (source path, section, page) used for citations
import { Document } from '@langchain/core/documents';

const doc = new Document({
  pageContent: 'pgvector adds vector search to PostgreSQL.',
  metadata: { source: 'notes/pgvector.txt', section: 'basics' }
});
Enter fullscreen mode Exit fullscreen mode

Load a single file

Use TextLoader for plain text or markdown files:

import { TextLoader } from '@langchain/classic/document_loaders/fs/text';

const loader = new TextLoader('./notes/pgvector.txt');
const docs = await loader.load();

console.log(docs[0].pageContent);
console.log(docs[0].metadata.source);
Enter fullscreen mode Exit fullscreen mode

The loader sets metadata.source to the file path - keep it for citations in RAG answers.

Load a directory

Use DirectoryLoader when you have many files. Map extensions to loader factories:

import { DirectoryLoader } from '@langchain/classic/document_loaders/fs/directory';
import { TextLoader } from '@langchain/classic/document_loaders/fs/text';

const loader = new DirectoryLoader('./notes', {
  '.txt': (path) => new TextLoader(path),
  '.md': (path) => new TextLoader(path)
});

const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);
Enter fullscreen mode Exit fullscreen mode

PDF, CSV, and JSON loaders are available via other integration packages. This post uses .txt and .md files.

Split documents

Chunking makes retrieval more precise. Instead of embedding one large file, split it into smaller overlapping parts. Pass the docs array from TextLoader or DirectoryLoader to a splitter:

Two parameters matter most:

  • chunkSize - target maximum size per chunk (characters or tokens, depending on splitter)
  • chunkOverlap - shared text between adjacent chunks so context is not lost at boundaries

Start with chunkSize: 800 and chunkOverlap: 120, then tune based on document style and answer quality.

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120
});

const chunks = await splitter.splitDocuments(docs);
console.log(chunks.length);
Enter fullscreen mode Exit fullscreen mode

Splitter comparison

The example above uses RecursiveCharacterTextSplitter, the default for most RAG setups. Alternatives:

Splitter Best for
RecursiveCharacterTextSplitter Default choice; tries paragraphs, then sentences, then words
CharacterTextSplitter Fixed character windows when structure does not matter
TokenTextSplitter When chunk limits must match model token budgets

Character-based:

import { CharacterTextSplitter } from '@langchain/textsplitters';

const splitter = new CharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120
});

const chunks = await splitter.splitDocuments(docs);
Enter fullscreen mode Exit fullscreen mode

Token-based:

import { TokenTextSplitter } from '@langchain/textsplitters';

const splitter = new TokenTextSplitter({
  encodingName: 'cl100k_base',
  chunkSize: 200,
  chunkOverlap: 20
});

const chunks = await splitter.splitDocuments(docs);
Enter fullscreen mode Exit fullscreen mode

Use token-based splitting when chunks must fit within a model's context window. Character-based recursive splitting is the usual starting point for RAG over prose.

Metadata through the pipeline

Pass metadata when creating documents manually, or rely on loader metadata - splitters preserve it on each chunk:

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 60
});

const chunks = await splitter.createDocuments(
  ['First paragraph.\n\nSecond paragraph.'],
  [{ source: 'manual', section: 'intro' }]
);

console.log(chunks[0].metadata);
Enter fullscreen mode Exit fullscreen mode

After splitDocuments(docs), each chunk keeps fields like source from the parent document. Use those fields when storing chunks in a vector database or displaying citations.

Choosing parameters

  • Short FAQs or API docs - smaller chunkSize (300–500) for precise retrieval
  • Long guides or blog posts - larger chunkSize (800–1200) to keep sections together
  • More overlap - helps when answers span chunk boundaries; increases storage and embedding cost
  • Less overlap - fewer redundant chunks; risk losing context at splits

Tune with real questions from your domain.

Demo

Runnable loader and splitter scripts for this post live in the langchain-loaders-chunking-demo folder. Get access via code demos.

Top comments (0)