Have you ever downloaded an ebook collection only to find it's a single massive EPUB file containing multiple books? This is especially common with translated works where publishers bundle entire series or author collections into one file. In this article, I'll explain how to split these collections into individual EPUB files, complete with code examples and technical details.
Why Split EPUB Collections?
Here are some common scenarios where splitting EPUBs becomes essential:
1. Collection Bundles
- Publisher bundles like "The Complete Works of [Author]" containing 10+ books
- Box sets sold as single files (e.g., "The Complete Sherlock Holmes Collection [10 Books Box Set].epub")
- Series compilations where each book should be separate
2. Reading Experience
- E-readers with limited storage prefer smaller files
- Better organization in your digital library
- Easier to sync specific books across devices
- Faster loading times on older devices
3. Content Management
- Selectively share individual books
- Remove unwanted titles from a collection
- Reorganize chapters or books in a different order
4. Academic Use
- Extract specific volumes for citation
- Distribute individual readings to students
- Create custom anthologies
Understanding EPUB Structure
Before diving into the code, let's understand how EPUB files work:
EPUB File (ZIP archive)
├── mimetype
├── META-INF/
│ └── container.xml
└── OEBPS/
├── content.opf
├── toc.ncx
├── chapter1.html
├── chapter2.html
└── images/
Key files:
-
mimetype: Must be the first file, uncompressed, containing
application/epub+zip - container.xml: Points to the OPF file location
- content.opf: Contains metadata and manifest of all resources
- toc.ncx: The table of contents with navigation points (navPoints)
The Splitting Process
Here's how the splitting algorithm works:
Key Technical Details
1. Parsing the Table of Contents
The NCX (Navigation Control file for XML) is where the magic happens. It contains navPoints that define the book structure:
// Parse NCX to find book boundaries
function parseNCX(ncxXml: string): TocEntry[] {
const tocData: TocEntry[] = [];
const $ = cheerio.load(ncxXml, { xmlMode: true });
$('navPoint').each((i, elem) => {
const textMatch = $(elem).html()?.match(/<text>([^<]+)<\/text>/i);
let label = textMatch ? textMatch[1] : '';
label = decodeHtmlEntities(label); // Handle &, Ӓ, etc.
let href = $('content', elem).first().attr('src') || '';
href = decodeURIComponent(href);
if (label && href) {
tocData.push({ label, href });
}
});
return tocData;
}
Key insight: The href contains both the HTML file path and optional anchor (e.g., chapter1.html#section2).
2. Filtering Non-Book Chapters
Collections often include metadata chapters like copyright, introduction, or table of contents. We filter these out:
// Filter out non-book chapters
const books = tocData.filter(entry => {
const lowerTitle = entry.label.toLowerCase();
const skipKeywords = [
'copyright', 'contents', 'introduction', 'preface',
'dedication', 'acknowledgments', 'foreword', 'prologue'
];
return !skipKeywords.some(kw => lowerTitle.includes(kw));
});
3. Extracting Content by Anchor
This is the most critical part. When multiple books share the same HTML file, we use anchors to extract specific sections:
function extractContentByAnchor(
html: string,
startAnchor: string | null,
endAnchor: string | null
): string {
// Find body content
const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
if (!bodyMatch) return html;
const bodyContent = bodyMatch[1];
let startPos = 0;
// Find start anchor position
if (startAnchor) {
const pattern = new RegExp(`<a[^>]*id=["']${startAnchor}["'][^>]*>`, 'i');
const match = bodyContent.match(pattern);
if (match?.index !== undefined) {
startPos = match.index + match[0].length;
}
}
// Find end anchor position
let endPos = bodyContent.length;
if (endAnchor) {
const pattern = new RegExp(`<a[^>]*id=["']${endAnchor}["'][^>]*>`, 'i');
const match = bodyContent.match(pattern);
if (match?.index !== undefined) {
endPos = match.index;
}
}
// Extract content between anchors
let content = bodyContent.substring(startPos, endPos);
// Fix incomplete tags at boundaries
const lastCompleteTag = content.lastIndexOf('>');
const lastOpeningTag = content.lastIndexOf('<');
if (lastOpeningTag > lastCompleteTag) {
content = content.substring(0, lastCompleteTag + 1);
}
return content.trim();
}
4. Handling Multi-File Content
Sometimes a book spans multiple HTML files in the spine. We need to concatenate them:
function extractContentAcrossFiles(
fs: MemoryFileSystem,
spine: string[],
opfDir: string,
startHtmlPath: string,
startAnchor: string | null,
endHtmlPath: string,
endAnchor: string | null
): string {
let content = '';
// Find positions in spine
const startFile = startHtmlPath.split('/').pop();
const endFile = endHtmlPath.split('/').pop();
const startIndex = spine.findIndex(h => h.endsWith('/' + startFile));
const endIndex = spine.findIndex(h => h.endsWith('/' + endFile));
// Extract from each file in range
for (let i = startIndex; i <= (endIndex === -1 ? spine.length - 1 : endIndex); i++) {
const htmlPath = opfDir ? `${opfDir}/${spine[i]}` : spine[i];
const html = fs.readFileText(htmlPath);
if (i === startIndex && i === endIndex) {
// Same file
content += extractContentByAnchor(html, startAnchor, endAnchor);
} else if (i === startIndex) {
// Start file: from anchor to end
content += extractContentByAnchor(html, startAnchor, null);
} else if (i === endIndex) {
// End file: from start to anchor
content += '\n' + extractContentByAnchor(html, null, endAnchor);
} else {
// Middle file: full content
const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
if (bodyMatch) content += '\n' + bodyMatch[1].trim();
}
}
return content;
}
5. Resource Collection
We must collect all images and CSS files referenced by the extracted content:
// Collect images
const $ = cheerio.load(bookHtml);
const imagePaths: string[] = [];
const imagesToInclude = new Set<string>();
$('img').each((_j, img) => {
const src = $(img).attr('src');
if (src) {
// Resolve relative paths
let imgPath: string;
if (src.startsWith('../')) {
const parts = htmlDir.split('/');
parts.pop();
imgPath = [...parts, src.replace('../', '')].join('/');
} else {
imgPath = `${htmlDir}/${src}`;
}
if (fs.fileExists(imgPath)) {
imagesToInclude.add(imgPath);
imagePaths.push(imgPath);
}
}
});
// Similar logic for CSS files
$('link[rel="stylesheet"]').each((_j, link) => {
// ... resolve path and collect
});
6. Generating Valid EPUB Structure
Each split book needs proper EPUB structure:
async function createEpub(
title: string,
author: string,
htmlContent: string,
images: string[],
cssFiles: string[]
): Promise<Uint8Array> {
const bookZip = new JSZip();
// 1. Mimetype - MUST be first and uncompressed
bookZip.file('mimetype', 'application/epub+zip', {
compression: 'STORE'
});
// 2. Container XML
const containerXml = `<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>`;
bookZip.file('META-INF/container.xml', containerXml);
// 3. Content files
bookZip.file('OEBPS/content.html', htmlContent);
// 4. Add resources
images.forEach(imgPath => {
const data = fs.readFile(imgPath);
const relativePath = imgPath.split('/').pop();
bookZip.file(`OEBPS/${relativePath}`, data);
});
// 5. Generate OPF and NCX
const opfContent = generateOPF(title, author, 'content.html', images, cssFiles);
bookZip.file('OEBPS/content.opf', opfContent);
const ncxContent = generateNCX(title, [{ label: title, href: 'content.html' }]);
bookZip.file('OEBPS/toc.ncx', ncxContent);
// 6. Generate final EPUB
return await bookZip.generateAsync({
type: 'uint8array',
compression: 'DEFLATE',
compressionOptions: { level: 6 },
platform: 'UNIX'
});
}
7. Generating OPF and NCX Files
The OPF (Open Package Format) file defines the book's metadata and manifest:
function generateOPF(
title: string,
author: string,
language: string,
htmlFile: string,
imagePaths: string[],
cssPaths: string[]
): string {
let manifestItems = '';
let idCounter = 0;
// NCX (table of contents)
const ncxId = `ncx${idCounter++}`;
manifestItems += ` <item id="${ncxId}" href="toc.ncx" media-type="application/x-dtbncx+xml"/>\n`;
// Main HTML
const htmlId = `html${idCounter++}`;
manifestItems += ` <item id="${htmlId}" href="${htmlFile}" media-type="application/xhtml+xml"/>\n`;
// Images
imagePaths.forEach((imgPath, idx) => {
const ext = imgPath.split('.').pop()?.toLowerCase();
let mediaType = 'image/jpeg';
if (ext === 'png') mediaType = 'image/png';
else if (ext === 'gif') mediaType = 'image/gif';
manifestItems += ` <item id="img${idx}" href="${imgPath}" media-type="${mediaType}"/>\n`;
});
// CSS
cssPaths.forEach((cssPath, idx) => {
manifestItems += ` <item id="css${idx}" href="${cssPath}" media-type="text/css"/>\n`;
});
return `<?xml version="1.0" encoding="UTF-8"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>${escapeXml(title)}</dc:title>
<dc:creator>${escapeXml(author)}</dc:creator>
<dc:language>${language}</dc:language>
<dc:identifier id="BookId">urn:uuid:${generateUUID()}</dc:identifier>
</metadata>
<manifest>
${manifestItems} </manifest>
<spine toc="${ncxId}">
<itemref idref="${htmlId}"/>
</spine>
</package>`;
}
Browser vs Node.js Implementation
The code can run in both environments:
Node.js (using adm-zip):
- Direct filesystem access
- Better for batch processing
- Can use cheerio for HTML parsing
Browser (using JSZip + Web Workers):
- Privacy-first: files never leave the device
- Uses Comlink for worker communication
- Memory-efficient with streaming
Browser Architecture
Implementation Tips
Anchor Detection: Not all EPUBs use consistent anchor formats. Support both
id="anchor"andname="anchor"attributes.HTML Entities: Always decode HTML entities in titles (
&,Ӓ, etc.)Path Resolution: Handle relative paths correctly (
../,./, absolute)File Structure: Maintain the original EPUB's directory structure for resources
Mimetype: Always add mimetype as the first file in the ZIP with
compression: 'STORE'Error Handling: Some collections may have malformed NCX files. Provide fallback extraction strategies.
Memory Management: For large collections, process books sequentially rather than loading everything into memory.
Conclusion
Splitting EPUB collections requires understanding the EPUB specification, particularly the NCX navigation structure and how content is organized across HTML files. The key challenges are:
- Parsing the table of contents correctly
- Extracting content by anchors when multiple books share HTML files
- Collecting all referenced resources (images, CSS)
- Generating valid EPUB metadata files
With the techniques described in this article, you can build robust EPUB splitting tools that work both in browsers for privacy and on servers for batch processing.
The complete implementation is available as an open-source web tool that runs entirely in your browser - no files are ever uploaded to any server, ensuring complete privacy for your ebook collection.
Try It Now
Ready to split your EPUB collections? Try our free online EPUB splitter - it works entirely in your browser with complete privacy protection:
Free EPUB Splitter - Split Collections into Individual Books
Features:
- Split EPUB collections into individual books automatically
- 100% browser-based - no files uploaded to any server
- Preserves formatting, images, and metadata
- Download all split books as a ZIP file
- Completely free to use
Try it yourself: The code examples in this article are based on a real implementation. You can adapt them for your own EPUB processing needs, whether building a command-line tool, a web application, or integrating into a larger ebook management system.


Top comments (0)