DEV Community

Fit Happens ML
Fit Happens ML

Posted on

Building a Custom FB2 Converter for LLMs

TL;DR:
I "vibecoded" a custom Python tool to convert legacy .fb2 e-books into structured Markdown (perfect for LLMs/RAG/Obsidian) and Plain Text. No ads, no bloat, just 150 lines of code.

  • The App: Live Streamlit Demo
  • The Code:

    GitHub logo FitHappensML / fb2-to-md-converter

    A lightweight CLI and Web tool to convert FB2 books to Markdown or Plain Text. Includes a built-in Streamlit reader and smart formatting support.

    FB2 to TXT/MD Converter & Reader

    Streamlit App Python Version License

    πŸš€ Try the Live Demo: fb2-to-md-converter.streamlit.app

    ✍️ Read the Story behind the code: Vibecoding Your Way Out of Format Hell (Medium)

    This project is a Python-based utility for converting .fb2 (FictionBook) files into .txt (plain text) or .md (Markdown). The tool offers two interfaces: a user-friendly web UI built with Streamlit for reading and converting, and a command-line interface (CLI) for fast processing and automation.

    ✨ Key Features

    • Dual Interfaces
      • 🎨 Web UI (Streamlit): Upload your files, read books directly in the browser, and download the result in your desired format.
      • βš™οΈ Command-Line Interface (CLI): Quickly convert files from your terminal, perfect for scripting and batch processing.
    • Smart Formatting: An option to convert FB2 tags (like subtitles and emphasis) into corresponding Markdown syntax.
    • Dual Export Formats: Save your books as clean .txt or as formatted .md files.
    • Built-in Reader…

Hello, fellow builders!

If you're anything like me, you probably have a digital hoard of books. In my case, it's a massive collection of .fb2 (FictionBook) files. Solid format, xml-based, widely supported... until you need to feed it into a Large Language Model (LLM).

Here's the problem: LLMs eat text, not XML tags.

I needed to convert my entire library into clean, structured Markdown. I needed headers to actually be headers (### Chapter 1), and emphasis to be italics (*wow*), so the model understands the semantic structure of the narrative.

I looked at existing tools.

  • The Desktop Apps: Bloated, require installation, often Windows-only 90s relics.
  • The Online Converters: "Upload clean_code.fb2... waiting... Download your file after watching this 30s ad". No thanks.
  • The Scripts: Most just strip all tags blindly, turning a beautiful dialogue into a wall of text.

If you are a developer in 2026, you don't hunt for software. You vibecode it.

It is faster to tailor a bespoke suit of a script than to shop for ill-fitting off-the-rack solutions. Plus, when you build it, you own the pipeline.

So, I built my own FB2 to Markdown converter. It has a CLI for batch processing and a Streamlit UI because sometimes I just want to read a chapter in the browser.

Here is how I did it, and how you can do it too.

The Strategy: FB2 is just XML

Don't overcomplicate it. An FB2 file is just an XML file with a specific schema. We don't need a heavy e-book library; we need BeautifulSoup.

Here is the core logic. We parse the XML, look for specific tags (<subtitle>, <emphasis>), and map them to Markdown.

The Core Converter

I created a converter.py. The trick is handling nested tags. A paragraph <p> might contain <emphasis> inside it.

from bs4 import BeautifulSoup
from bs4.element import Tag

def _get_formatted_text(tag: Tag) -> str:
    """Recursively process tags to keep italics and bolding."""
    parts = []
    for item in tag.children:
        if isinstance(item, Tag):
            if item.name == 'emphasis':
                parts.append(f'*{item.get_text(strip=True)}*')
            else:
                parts.append(_get_formatted_text(item))
        else:
            parts.append(str(item))
    return "".join(parts)
Enter fullscreen mode Exit fullscreen mode

This recursive function is the secret sauce. Instead of text_content() which flattens everything, this preserves the vibe of the text.

Then, the main loop allows us to choose between "Raw Text" and "Smart Formatting":

def convert_fb2_to_txt(fb2_content: str, smart_formatting: bool = False) -> str:
    soup = BeautifulSoup(fb2_content, 'lxml-xml')
    text_parts = []

    # Extract Metadata (Title, Author)
    description = soup.find('description')
    # ... extraction logic ...

    # Extract Body
    body = soup.find('body')
    for element in body.find_all(['p', 'subtitle', 'empty-line']):
        if element.name == 'p':
            if smart_formatting:
                 text_parts.append(_get_formatted_text(element).strip() + '\n\n')
            else:
                 text_parts.append(element.get_text(strip=True) + '\n\n')
        elif element.name == 'subtitle':
            # Boom: Semantic headers for the LLM
            text_parts.append(f"### {element.get_text(strip=True)}\n\n")

    return "".join(text_parts)
Enter fullscreen mode Exit fullscreen mode

The Interface: Streamlit for Instant Gratification

I love CLI, but sometimes I want to drag-and-drop. Streamlit is perfect for this. It takes 5 minutes to build a UI that looks decent.

In app.py, I process the file and offer two flavors of download:

import streamlit as st
from converter import convert_fb2_to_txt

st.title("πŸ“– FB2 Reader & Converter")

uploaded_file = st.file_uploader("Upload .fb2", type=['fb2'])

if uploaded_file:
    # Read the bytes
    content = uploaded_file.getvalue().decode('utf-8')

    # Dual Conversion
    plain_text = convert_fb2_to_txt(content, smart_formatting=False)
    markdown_text = convert_fb2_to_txt(content, smart_formatting=True)

    # The Reader View
    st.markdown(markdown_text)

    # Sidebar Downloads
    st.sidebar.download_button("Download .md", markdown_text, file_name="book.md")
    st.sidebar.download_button("Download .txt", plain_text, file_name="book.txt")
Enter fullscreen mode Exit fullscreen mode

This gives me immediate visual verification. I can see if the <subtitle> tags are actually rendering as headers before I commit to converting my whole library.

The CLI: For the Serious Batching

Finally, cli.py. Because I'm not going to drag-and-drop 500 books.

import argparse
from converter import convert_fb2_to_txt

# ... setup format args ...

if args.format == 'md':
    result = convert_fb2_to_txt(content, smart_formatting=True)
else:
    result = convert_fb2_to_txt(content, smart_formatting=False)
Enter fullscreen mode Exit fullscreen mode

Now I can just run:
python cli.py "War_and_Peace.fb2" -f md
And get a perfect Markdown file ready for RAG (Retrieval-Augmented Generation) or fine-tuning.

Why "Vibecode" it?

Could I have found a tool to do this? Probably.
Would it handle the specifics of <subtitle> tags nested in <section> blocks exactly how I wanted? No.

By spending an hour writing this, I now have a tool that is:

  1. Fast: No uploads, no ads.
  2. Private: My books stay on my machine.
  3. Correct: The output is formatted exactly for my LLM's consumption.

In the era of AI, the ability to quickly whip up data transformation scripts is a superpower. Don't be afraid to reinvent the wheel if the tire on the existing wheel is flat.

Happy coding!

Top comments (0)