monkeymore studio

Posted on Apr 2

How to Split an EPUB Collection into Individual Books

#programming #productivity #tooling #tutorial

Have you ever downloaded an ebook collection only to find it's a single massive EPUB file containing multiple books? This is especially common with translated works where publishers bundle entire series or author collections into one file. In this article, I'll explain how to split these collections into individual EPUB files, complete with code examples and technical details.

Why Split EPUB Collections?

Here are some common scenarios where splitting EPUBs becomes essential:

1. Collection Bundles

Publisher bundles like "The Complete Works of [Author]" containing 10+ books
Box sets sold as single files (e.g., "The Complete Sherlock Holmes Collection [10 Books Box Set].epub")
Series compilations where each book should be separate

2. Reading Experience

E-readers with limited storage prefer smaller files
Better organization in your digital library
Easier to sync specific books across devices
Faster loading times on older devices

3. Content Management

Selectively share individual books
Remove unwanted titles from a collection
Reorganize chapters or books in a different order

4. Academic Use

Extract specific volumes for citation
Distribute individual readings to students
Create custom anthologies

Understanding EPUB Structure

Before diving into the code, let's understand how EPUB files work:

EPUB File (ZIP archive)
├── mimetype
├── META-INF/
│   └── container.xml
└── OEBPS/
    ├── content.opf
    ├── toc.ncx
    ├── chapter1.html
    ├── chapter2.html
    └── images/

Key files:

mimetype: Must be the first file, uncompressed, containing application/epub+zip
container.xml: Points to the OPF file location
content.opf: Contains metadata and manifest of all resources
toc.ncx: The table of contents with navigation points (navPoints)

The Splitting Process

Here's how the splitting algorithm works:

Key Technical Details

1. Parsing the Table of Contents

The NCX (Navigation Control file for XML) is where the magic happens. It contains navPoints that define the book structure:

// Parse NCX to find book boundaries
function parseNCX(ncxXml: string): TocEntry[] {
  const tocData: TocEntry[] = [];
  const $ = cheerio.load(ncxXml, { xmlMode: true });

  $('navPoint').each((i, elem) => {
    const textMatch = $(elem).html()?.match(/<text>([^<]+)<\/text>/i);
    let label = textMatch ? textMatch[1] : '';
    label = decodeHtmlEntities(label); // Handle &amp;, &#1234;, etc.

    let href = $('content', elem).first().attr('src') || '';
    href = decodeURIComponent(href);

    if (label && href) {
      tocData.push({ label, href });
    }
  });

  return tocData;
}

Key insight: The href contains both the HTML file path and optional anchor (e.g., chapter1.html#section2).

2. Filtering Non-Book Chapters

Collections often include metadata chapters like copyright, introduction, or table of contents. We filter these out:

// Filter out non-book chapters
const books = tocData.filter(entry => {
  const lowerTitle = entry.label.toLowerCase();
  const skipKeywords = [
    'copyright', 'contents', 'introduction', 'preface',
    'dedication', 'acknowledgments', 'foreword', 'prologue'
  ];
  return !skipKeywords.some(kw => lowerTitle.includes(kw));
});

3. Extracting Content by Anchor

This is the most critical part. When multiple books share the same HTML file, we use anchors to extract specific sections:

function extractContentByAnchor(
  html: string, 
  startAnchor: string | null, 
  endAnchor: string | null
): string {
  // Find body content
  const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
  if (!bodyMatch) return html;

  const bodyContent = bodyMatch[1];
  let startPos = 0;

  // Find start anchor position
  if (startAnchor) {
    const pattern = new RegExp(`<a[^>]*id=["']${startAnchor}["'][^>]*>`, 'i');
    const match = bodyContent.match(pattern);
    if (match?.index !== undefined) {
      startPos = match.index + match[0].length;
    }
  }

  // Find end anchor position
  let endPos = bodyContent.length;
  if (endAnchor) {
    const pattern = new RegExp(`<a[^>]*id=["']${endAnchor}["'][^>]*>`, 'i');
    const match = bodyContent.match(pattern);
    if (match?.index !== undefined) {
      endPos = match.index;
    }
  }

  // Extract content between anchors
  let content = bodyContent.substring(startPos, endPos);

  // Fix incomplete tags at boundaries
  const lastCompleteTag = content.lastIndexOf('>');
  const lastOpeningTag = content.lastIndexOf('<');
  if (lastOpeningTag > lastCompleteTag) {
    content = content.substring(0, lastCompleteTag + 1);
  }

  return content.trim();
}

4. Handling Multi-File Content

Sometimes a book spans multiple HTML files in the spine. We need to concatenate them:

function extractContentAcrossFiles(
  fs: MemoryFileSystem,
  spine: string[],
  opfDir: string,
  startHtmlPath: string,
  startAnchor: string | null,
  endHtmlPath: string,
  endAnchor: string | null
): string {
  let content = '';

  // Find positions in spine
  const startFile = startHtmlPath.split('/').pop();
  const endFile = endHtmlPath.split('/').pop();
  const startIndex = spine.findIndex(h => h.endsWith('/' + startFile));
  const endIndex = spine.findIndex(h => h.endsWith('/' + endFile));

  // Extract from each file in range
  for (let i = startIndex; i <= (endIndex === -1 ? spine.length - 1 : endIndex); i++) {
    const htmlPath = opfDir ? `${opfDir}/${spine[i]}` : spine[i];
    const html = fs.readFileText(htmlPath);

    if (i === startIndex && i === endIndex) {
      // Same file
      content += extractContentByAnchor(html, startAnchor, endAnchor);
    } else if (i === startIndex) {
      // Start file: from anchor to end
      content += extractContentByAnchor(html, startAnchor, null);
    } else if (i === endIndex) {
      // End file: from start to anchor
      content += '\n' + extractContentByAnchor(html, null, endAnchor);
    } else {
      // Middle file: full content
      const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
      if (bodyMatch) content += '\n' + bodyMatch[1].trim();
    }
  }

  return content;
}

5. Resource Collection

We must collect all images and CSS files referenced by the extracted content:

// Collect images
const $ = cheerio.load(bookHtml);
const imagePaths: string[] = [];
const imagesToInclude = new Set<string>();

$('img').each((_j, img) => {
  const src = $(img).attr('src');
  if (src) {
    // Resolve relative paths
    let imgPath: string;
    if (src.startsWith('../')) {
      const parts = htmlDir.split('/');
      parts.pop();
      imgPath = [...parts, src.replace('../', '')].join('/');
    } else {
      imgPath = `${htmlDir}/${src}`;
    }

    if (fs.fileExists(imgPath)) {
      imagesToInclude.add(imgPath);
      imagePaths.push(imgPath);
    }
  }
});

// Similar logic for CSS files
$('link[rel="stylesheet"]').each((_j, link) => {
  // ... resolve path and collect
});

6. Generating Valid EPUB Structure

Each split book needs proper EPUB structure:

async function createEpub(
  title: string,
  author: string,
  htmlContent: string,
  images: string[],
  cssFiles: string[]
): Promise<Uint8Array> {
  const bookZip = new JSZip();

  // 1. Mimetype - MUST be first and uncompressed
  bookZip.file('mimetype', 'application/epub+zip', { 
    compression: 'STORE' 
  });

  // 2. Container XML
  const containerXml = `<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>`;
  bookZip.file('META-INF/container.xml', containerXml);

  // 3. Content files
  bookZip.file('OEBPS/content.html', htmlContent);

  // 4. Add resources
  images.forEach(imgPath => {
    const data = fs.readFile(imgPath);
    const relativePath = imgPath.split('/').pop();
    bookZip.file(`OEBPS/${relativePath}`, data);
  });

  // 5. Generate OPF and NCX
  const opfContent = generateOPF(title, author, 'content.html', images, cssFiles);
  bookZip.file('OEBPS/content.opf', opfContent);

  const ncxContent = generateNCX(title, [{ label: title, href: 'content.html' }]);
  bookZip.file('OEBPS/toc.ncx', ncxContent);

  // 6. Generate final EPUB
  return await bookZip.generateAsync({
    type: 'uint8array',
    compression: 'DEFLATE',
    compressionOptions: { level: 6 },
    platform: 'UNIX'
  });
}

7. Generating OPF and NCX Files

The OPF (Open Package Format) file defines the book's metadata and manifest:

function generateOPF(
  title: string,
  author: string,
  language: string,
  htmlFile: string,
  imagePaths: string[],
  cssPaths: string[]
): string {
  let manifestItems = '';
  let idCounter = 0;

  // NCX (table of contents)
  const ncxId = `ncx${idCounter++}`;
  manifestItems += `    <item id="${ncxId}" href="toc.ncx" media-type="application/x-dtbncx+xml"/>\n`;

  // Main HTML
  const htmlId = `html${idCounter++}`;
  manifestItems += `    <item id="${htmlId}" href="${htmlFile}" media-type="application/xhtml+xml"/>\n`;

  // Images
  imagePaths.forEach((imgPath, idx) => {
    const ext = imgPath.split('.').pop()?.toLowerCase();
    let mediaType = 'image/jpeg';
    if (ext === 'png') mediaType = 'image/png';
    else if (ext === 'gif') mediaType = 'image/gif';

    manifestItems += `    <item id="img${idx}" href="${imgPath}" media-type="${mediaType}"/>\n`;
  });

  // CSS
  cssPaths.forEach((cssPath, idx) => {
    manifestItems += `    <item id="css${idx}" href="${cssPath}" media-type="text/css"/>\n`;
  });

  return `<?xml version="1.0" encoding="UTF-8"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:title>${escapeXml(title)}</dc:title>
    <dc:creator>${escapeXml(author)}</dc:creator>
    <dc:language>${language}</dc:language>
    <dc:identifier id="BookId">urn:uuid:${generateUUID()}</dc:identifier>
  </metadata>
  <manifest>
${manifestItems}  </manifest>
  <spine toc="${ncxId}">
    <itemref idref="${htmlId}"/>
  </spine>
</package>`;
}

Browser vs Node.js Implementation

The code can run in both environments:

Node.js (using adm-zip):

Direct filesystem access
Better for batch processing
Can use cheerio for HTML parsing

Browser (using JSZip + Web Workers):

Privacy-first: files never leave the device
Uses Comlink for worker communication
Memory-efficient with streaming

Browser Architecture

Implementation Tips

Anchor Detection: Not all EPUBs use consistent anchor formats. Support both id="anchor" and name="anchor" attributes.
HTML Entities: Always decode HTML entities in titles (&, Ӓ, etc.)
Path Resolution: Handle relative paths correctly (../, ./, absolute)
File Structure: Maintain the original EPUB's directory structure for resources
Mimetype: Always add mimetype as the first file in the ZIP with compression: 'STORE'
Error Handling: Some collections may have malformed NCX files. Provide fallback extraction strategies.
Memory Management: For large collections, process books sequentially rather than loading everything into memory.

Conclusion

Splitting EPUB collections requires understanding the EPUB specification, particularly the NCX navigation structure and how content is organized across HTML files. The key challenges are:

Parsing the table of contents correctly
Extracting content by anchors when multiple books share HTML files
Collecting all referenced resources (images, CSS)
Generating valid EPUB metadata files

With the techniques described in this article, you can build robust EPUB splitting tools that work both in browsers for privacy and on servers for batch processing.

The complete implementation is available as an open-source web tool that runs entirely in your browser - no files are ever uploaded to any server, ensuring complete privacy for your ebook collection.

Try It Now

Ready to split your EPUB collections? Try our free online EPUB splitter - it works entirely in your browser with complete privacy protection:

Free EPUB Splitter - Split Collections into Individual Books

Features:

Split EPUB collections into individual books automatically
100% browser-based - no files uploaded to any server
Preserves formatting, images, and metadata
Download all split books as a ZIP file
Completely free to use

Try it yourself: The code examples in this article are based on a real implementation. You can adapt them for your own EPUB processing needs, whether building a command-line tool, a web application, or integrating into a larger ebook management system.

DEV Community