DEV Community

monkeymore studio
monkeymore studio

Posted on

How to Split an EPUB Collection into Individual Books

Have you ever downloaded an ebook collection only to find it's a single massive EPUB file containing multiple books? This is especially common with translated works where publishers bundle entire series or author collections into one file. In this article, I'll explain how to split these collections into individual EPUB files, complete with code examples and technical details.

Why Split EPUB Collections?

Here are some common scenarios where splitting EPUBs becomes essential:

1. Collection Bundles

  • Publisher bundles like "The Complete Works of [Author]" containing 10+ books
  • Box sets sold as single files (e.g., "The Complete Sherlock Holmes Collection [10 Books Box Set].epub")
  • Series compilations where each book should be separate

2. Reading Experience

  • E-readers with limited storage prefer smaller files
  • Better organization in your digital library
  • Easier to sync specific books across devices
  • Faster loading times on older devices

3. Content Management

  • Selectively share individual books
  • Remove unwanted titles from a collection
  • Reorganize chapters or books in a different order

4. Academic Use

  • Extract specific volumes for citation
  • Distribute individual readings to students
  • Create custom anthologies

Understanding EPUB Structure

Before diving into the code, let's understand how EPUB files work:

EPUB File (ZIP archive)
├── mimetype
├── META-INF/
│   └── container.xml
└── OEBPS/
    ├── content.opf
    ├── toc.ncx
    ├── chapter1.html
    ├── chapter2.html
    └── images/
Enter fullscreen mode Exit fullscreen mode

Key files:

  • mimetype: Must be the first file, uncompressed, containing application/epub+zip
  • container.xml: Points to the OPF file location
  • content.opf: Contains metadata and manifest of all resources
  • toc.ncx: The table of contents with navigation points (navPoints)

The Splitting Process

Here's how the splitting algorithm works:

EPUB Splitting Process Flowchart

Key Technical Details

1. Parsing the Table of Contents

The NCX (Navigation Control file for XML) is where the magic happens. It contains navPoints that define the book structure:

// Parse NCX to find book boundaries
function parseNCX(ncxXml: string): TocEntry[] {
  const tocData: TocEntry[] = [];
  const $ = cheerio.load(ncxXml, { xmlMode: true });

  $('navPoint').each((i, elem) => {
    const textMatch = $(elem).html()?.match(/<text>([^<]+)<\/text>/i);
    let label = textMatch ? textMatch[1] : '';
    label = decodeHtmlEntities(label); // Handle &amp;, &#1234;, etc.

    let href = $('content', elem).first().attr('src') || '';
    href = decodeURIComponent(href);

    if (label && href) {
      tocData.push({ label, href });
    }
  });

  return tocData;
}
Enter fullscreen mode Exit fullscreen mode

Key insight: The href contains both the HTML file path and optional anchor (e.g., chapter1.html#section2).

2. Filtering Non-Book Chapters

Collections often include metadata chapters like copyright, introduction, or table of contents. We filter these out:

// Filter out non-book chapters
const books = tocData.filter(entry => {
  const lowerTitle = entry.label.toLowerCase();
  const skipKeywords = [
    'copyright', 'contents', 'introduction', 'preface',
    'dedication', 'acknowledgments', 'foreword', 'prologue'
  ];
  return !skipKeywords.some(kw => lowerTitle.includes(kw));
});
Enter fullscreen mode Exit fullscreen mode

3. Extracting Content by Anchor

This is the most critical part. When multiple books share the same HTML file, we use anchors to extract specific sections:

function extractContentByAnchor(
  html: string, 
  startAnchor: string | null, 
  endAnchor: string | null
): string {
  // Find body content
  const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
  if (!bodyMatch) return html;

  const bodyContent = bodyMatch[1];
  let startPos = 0;

  // Find start anchor position
  if (startAnchor) {
    const pattern = new RegExp(`<a[^>]*id=["']${startAnchor}["'][^>]*>`, 'i');
    const match = bodyContent.match(pattern);
    if (match?.index !== undefined) {
      startPos = match.index + match[0].length;
    }
  }

  // Find end anchor position
  let endPos = bodyContent.length;
  if (endAnchor) {
    const pattern = new RegExp(`<a[^>]*id=["']${endAnchor}["'][^>]*>`, 'i');
    const match = bodyContent.match(pattern);
    if (match?.index !== undefined) {
      endPos = match.index;
    }
  }

  // Extract content between anchors
  let content = bodyContent.substring(startPos, endPos);

  // Fix incomplete tags at boundaries
  const lastCompleteTag = content.lastIndexOf('>');
  const lastOpeningTag = content.lastIndexOf('<');
  if (lastOpeningTag > lastCompleteTag) {
    content = content.substring(0, lastCompleteTag + 1);
  }

  return content.trim();
}
Enter fullscreen mode Exit fullscreen mode

4. Handling Multi-File Content

Sometimes a book spans multiple HTML files in the spine. We need to concatenate them:

function extractContentAcrossFiles(
  fs: MemoryFileSystem,
  spine: string[],
  opfDir: string,
  startHtmlPath: string,
  startAnchor: string | null,
  endHtmlPath: string,
  endAnchor: string | null
): string {
  let content = '';

  // Find positions in spine
  const startFile = startHtmlPath.split('/').pop();
  const endFile = endHtmlPath.split('/').pop();
  const startIndex = spine.findIndex(h => h.endsWith('/' + startFile));
  const endIndex = spine.findIndex(h => h.endsWith('/' + endFile));

  // Extract from each file in range
  for (let i = startIndex; i <= (endIndex === -1 ? spine.length - 1 : endIndex); i++) {
    const htmlPath = opfDir ? `${opfDir}/${spine[i]}` : spine[i];
    const html = fs.readFileText(htmlPath);

    if (i === startIndex && i === endIndex) {
      // Same file
      content += extractContentByAnchor(html, startAnchor, endAnchor);
    } else if (i === startIndex) {
      // Start file: from anchor to end
      content += extractContentByAnchor(html, startAnchor, null);
    } else if (i === endIndex) {
      // End file: from start to anchor
      content += '\n' + extractContentByAnchor(html, null, endAnchor);
    } else {
      // Middle file: full content
      const bodyMatch = html.match(/<body[^>]*>([\s\S]*?)<\/body>/i);
      if (bodyMatch) content += '\n' + bodyMatch[1].trim();
    }
  }

  return content;
}
Enter fullscreen mode Exit fullscreen mode

5. Resource Collection

We must collect all images and CSS files referenced by the extracted content:

// Collect images
const $ = cheerio.load(bookHtml);
const imagePaths: string[] = [];
const imagesToInclude = new Set<string>();

$('img').each((_j, img) => {
  const src = $(img).attr('src');
  if (src) {
    // Resolve relative paths
    let imgPath: string;
    if (src.startsWith('../')) {
      const parts = htmlDir.split('/');
      parts.pop();
      imgPath = [...parts, src.replace('../', '')].join('/');
    } else {
      imgPath = `${htmlDir}/${src}`;
    }

    if (fs.fileExists(imgPath)) {
      imagesToInclude.add(imgPath);
      imagePaths.push(imgPath);
    }
  }
});

// Similar logic for CSS files
$('link[rel="stylesheet"]').each((_j, link) => {
  // ... resolve path and collect
});
Enter fullscreen mode Exit fullscreen mode

6. Generating Valid EPUB Structure

Each split book needs proper EPUB structure:

async function createEpub(
  title: string,
  author: string,
  htmlContent: string,
  images: string[],
  cssFiles: string[]
): Promise<Uint8Array> {
  const bookZip = new JSZip();

  // 1. Mimetype - MUST be first and uncompressed
  bookZip.file('mimetype', 'application/epub+zip', { 
    compression: 'STORE' 
  });

  // 2. Container XML
  const containerXml = `<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>`;
  bookZip.file('META-INF/container.xml', containerXml);

  // 3. Content files
  bookZip.file('OEBPS/content.html', htmlContent);

  // 4. Add resources
  images.forEach(imgPath => {
    const data = fs.readFile(imgPath);
    const relativePath = imgPath.split('/').pop();
    bookZip.file(`OEBPS/${relativePath}`, data);
  });

  // 5. Generate OPF and NCX
  const opfContent = generateOPF(title, author, 'content.html', images, cssFiles);
  bookZip.file('OEBPS/content.opf', opfContent);

  const ncxContent = generateNCX(title, [{ label: title, href: 'content.html' }]);
  bookZip.file('OEBPS/toc.ncx', ncxContent);

  // 6. Generate final EPUB
  return await bookZip.generateAsync({
    type: 'uint8array',
    compression: 'DEFLATE',
    compressionOptions: { level: 6 },
    platform: 'UNIX'
  });
}
Enter fullscreen mode Exit fullscreen mode

7. Generating OPF and NCX Files

The OPF (Open Package Format) file defines the book's metadata and manifest:

function generateOPF(
  title: string,
  author: string,
  language: string,
  htmlFile: string,
  imagePaths: string[],
  cssPaths: string[]
): string {
  let manifestItems = '';
  let idCounter = 0;

  // NCX (table of contents)
  const ncxId = `ncx${idCounter++}`;
  manifestItems += `    <item id="${ncxId}" href="toc.ncx" media-type="application/x-dtbncx+xml"/>\n`;

  // Main HTML
  const htmlId = `html${idCounter++}`;
  manifestItems += `    <item id="${htmlId}" href="${htmlFile}" media-type="application/xhtml+xml"/>\n`;

  // Images
  imagePaths.forEach((imgPath, idx) => {
    const ext = imgPath.split('.').pop()?.toLowerCase();
    let mediaType = 'image/jpeg';
    if (ext === 'png') mediaType = 'image/png';
    else if (ext === 'gif') mediaType = 'image/gif';

    manifestItems += `    <item id="img${idx}" href="${imgPath}" media-type="${mediaType}"/>\n`;
  });

  // CSS
  cssPaths.forEach((cssPath, idx) => {
    manifestItems += `    <item id="css${idx}" href="${cssPath}" media-type="text/css"/>\n`;
  });

  return `<?xml version="1.0" encoding="UTF-8"?>
<package version="2.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="BookId">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:title>${escapeXml(title)}</dc:title>
    <dc:creator>${escapeXml(author)}</dc:creator>
    <dc:language>${language}</dc:language>
    <dc:identifier id="BookId">urn:uuid:${generateUUID()}</dc:identifier>
  </metadata>
  <manifest>
${manifestItems}  </manifest>
  <spine toc="${ncxId}">
    <itemref idref="${htmlId}"/>
  </spine>
</package>`;
}
Enter fullscreen mode Exit fullscreen mode

Browser vs Node.js Implementation

The code can run in both environments:

Node.js (using adm-zip):

  • Direct filesystem access
  • Better for batch processing
  • Can use cheerio for HTML parsing

Browser (using JSZip + Web Workers):

  • Privacy-first: files never leave the device
  • Uses Comlink for worker communication
  • Memory-efficient with streaming

Browser Architecture

Browser Architecture Sequence Diagram

Implementation Tips

  1. Anchor Detection: Not all EPUBs use consistent anchor formats. Support both id="anchor" and name="anchor" attributes.

  2. HTML Entities: Always decode HTML entities in titles (&amp;, &#1234;, etc.)

  3. Path Resolution: Handle relative paths correctly (../, ./, absolute)

  4. File Structure: Maintain the original EPUB's directory structure for resources

  5. Mimetype: Always add mimetype as the first file in the ZIP with compression: 'STORE'

  6. Error Handling: Some collections may have malformed NCX files. Provide fallback extraction strategies.

  7. Memory Management: For large collections, process books sequentially rather than loading everything into memory.

Conclusion

Splitting EPUB collections requires understanding the EPUB specification, particularly the NCX navigation structure and how content is organized across HTML files. The key challenges are:

  • Parsing the table of contents correctly
  • Extracting content by anchors when multiple books share HTML files
  • Collecting all referenced resources (images, CSS)
  • Generating valid EPUB metadata files

With the techniques described in this article, you can build robust EPUB splitting tools that work both in browsers for privacy and on servers for batch processing.

The complete implementation is available as an open-source web tool that runs entirely in your browser - no files are ever uploaded to any server, ensuring complete privacy for your ebook collection.


Try It Now

Ready to split your EPUB collections? Try our free online EPUB splitter - it works entirely in your browser with complete privacy protection:

Free EPUB Splitter - Split Collections into Individual Books

Features:

  • Split EPUB collections into individual books automatically
  • 100% browser-based - no files uploaded to any server
  • Preserves formatting, images, and metadata
  • Download all split books as a ZIP file
  • Completely free to use

Try it yourself: The code examples in this article are based on a real implementation. You can adapt them for your own EPUB processing needs, whether building a command-line tool, a web application, or integrating into a larger ebook management system.

Top comments (0)