Rost

Posted on Oct 24

HTML Preprocessing for LLMs

#html #python #programming

Converting HTML to Markdown is a fundamental task in modern development workflows, particularly when preparing web content for Large Language Models (LLMs), documentation systems, or static site generators like Hugo.

While HTML is designed for web browsers with rich styling and structure, Markdown offers a clean, readable format that's ideal for text processing, version control, and AI consumption. If you're new to Markdown syntax, check out our Markdown Cheatsheet for a comprehensive reference.

In this comprehensive review, we'll explore six Python packages for HTML-to-Markdown conversion, providing practical code examples, performance benchmarks, and real-world use cases. Whether you're building an LLM training pipeline, migrating a blog to Hugo, or scraping documentation, you'll find the perfect tool for your workflow.

Alternative Approach: If you need more intelligent content extraction with semantic understanding, you might also consider converting HTML to Markdown using LLM and Ollama, which offers AI-powered conversion for complex layouts.

What you'll learn:

Detailed comparison of 6 libraries with pros/cons for each
Performance benchmarks with real-world HTML samples
Production-ready code examples for common use cases
Best practices for LLM preprocessing workflows
Specific recommendations based on your requirements

Why Markdown for LLM Preprocessing?

Before diving into the tools, let's understand why Markdown is particularly valuable for LLM workflows:

Token Efficiency: Markdown uses significantly fewer tokens than HTML for the same content
Semantic Clarity: Markdown preserves document structure without verbose tags
Readability: Both humans and LLMs can easily parse Markdown's syntax
Consistency: Standardized format reduces ambiguity in model inputs
Storage: Smaller file sizes for training data and context windows

Markdown's versatility extends beyond HTML conversion—you can also convert Word Documents to Markdown for documentation workflows, or use it in knowledge management systems like Obsidian for Personal Knowledge Management.

TL;DR - Quick Comparison Matrix

If you're in a hurry, here's a comprehensive comparison of all six libraries at a glance. This table will help you quickly identify which tool matches your specific requirements:

Feature	html2text	markdownify	html-to-markdown	trafilatura	domscribe	html2md
HTML5 Support	Partial	Partial	Full	Full	Full	Full
Type Hints	No	No	Yes	Partial	No	Partial
Custom Handlers	Limited	Excellent	Good	Limited	Good	Limited
Table Support	Basic	Basic	Advanced	Good	Good	Good
Async Support	No	No	No	No	No	Yes
Content Extraction	No	No	No	Excellent	No	Good
Metadata Extraction	No	No	Yes	Excellent	No	Yes
CLI Tool	No	No	Yes	Yes	No	Yes
Speed	Medium	Slow	Fast	Very Fast	Medium	Very Fast
Active Development	No	Yes	Yes	Yes	Limited	Yes
Python Version	3.6+	3.7+	3.9+	3.6+	3.8+	3.10+
Dependencies	None	BS4	lxml	lxml	BS4	aiohttp

Quick Selection Guide:

Need speed? → trafilatura or html2md
Need customization? → markdownify
Need type safety? → html-to-markdown
Need simplicity? → html2text
Need content extraction? → trafilatura

Recommendations by Scenario

Still unsure which library to choose?
Here is a guide based on specific use cases.

For Web Scraping & LLM Preprocessing

Winner: trafilatura

Trafilatura excels at extracting clean content while removing boilerplate. Perfect for:

Building LLM training datasets
Content aggregation
Research paper collection
News article extraction

For Hugo/Jekyll Migrations

Winner: html2md

Async processing and frontmatter generation make bulk migrations fast and easy:

Batch conversions
Automatic metadata extraction
YAML frontmatter generation
Heading level adjustment

For Custom Conversion Logic

Winner: markdownify

Subclass the converter for complete control:

Custom tag handlers
Domain-specific conversions
Special formatting requirements
Integration with existing BeautifulSoup code

For Type-Safe Production Systems

Winner: html-to-markdown

Modern, type-safe, and feature-complete:

Full HTML5 support
Comprehensive type hints
Advanced table handling
Active maintenance

For Simple, Stable Conversions

Winner: html2text

When you need something that "just works":

No dependencies
Battle-tested
Extensive configuration
Wide platform support

Best Practices for LLM Preprocessing

Regardless of which library you choose, following these best practices will ensure high-quality Markdown output that's optimized for LLM consumption. These patterns have proven essential in production workflows processing millions of documents.

1. Clean Before Converting

Always remove unwanted elements before conversion to get cleaner output and better performance:

from bs4 import BeautifulSoup
import trafilatura

def clean_and_convert(html):
    """Remove unwanted elements before conversion"""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove unwanted elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        element.decompose()

    # Remove ads and tracking
    for element in soup.find_all(class_=['ad', 'advertisement', 'tracking']):
        element.decompose()

    # Convert cleaned HTML
    markdown = trafilatura.extract(
        str(soup),
        output_format='markdown'
    )

    return markdown

2. Normalize Whitespace

Different converters handle whitespace differently. Normalize the output to ensure consistency across your corpus:

import re

def normalize_markdown(markdown):
    """Clean up markdown spacing"""
    # Remove multiple blank lines
    markdown = re.sub(r'\n{3,}', '\n\n', markdown)

    # Remove trailing whitespace
    markdown = '\n'.join(line.rstrip() for line in markdown.split('\n'))

    # Ensure single newline at end
    markdown = markdown.rstrip() + '\n'

    return markdown

3. Validate Output

Quality control is essential. Implement validation to catch conversion errors early:

def validate_markdown(markdown):
    """Validate markdown quality"""
    issues = []

    # Check for HTML remnants
    if '<' in markdown and '>' in markdown:
        issues.append("HTML tags detected")

    # Check for broken links
    if '[' in markdown and ']()' in markdown:
        issues.append("Empty link detected")

    # Check for excessive code blocks
    code_block_count = markdown.count('``')
    if code_block_count % 2 != 0:
        issues.append("Unclosed code block")

    return len(issues) == 0, issues

4. Batch Processing Template

When processing large document collections, use this production-ready template with proper error handling, logging, and parallel processing:

from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
import trafilatura
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def process_file(html_path):
    """Process single HTML file"""
    try:
        html = Path(html_path).read_text(encoding='utf-8')
        markdown = trafilatura.extract(
            html,
            output_format='markdown',
            include_links=True,
            include_images=False
        )

        if markdown:
            # Normalize
            markdown = normalize_markdown(markdown)

            # Validate
            is_valid, issues = validate_markdown(markdown)
            if not is_valid:
                logger.warning(f"{html_path}: {', '.join(issues)}")

            # Save
            output_path = Path(str(html_path).replace('.html', '.md'))
            output_path.write_text(markdown, encoding='utf-8')

            return True

        return False

    except Exception as e:
        logger.error(f"Error processing {html_path}: {e}")
        return False

def batch_convert(input_dir, max_workers=4):
    """Convert all HTML files in directory"""
    html_files = list(Path(input_dir).rglob('*.html'))
    logger.info(f"Found {len(html_files)} HTML files")

    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_file, html_files))

    success_count = sum(results)
    logger.info(f"Successfully converted {success_count}/{len(html_files)} files")

# Usage
batch_convert('./html_docs', max_workers=8)

Conclusion

The Python ecosystem offers mature, production-ready tools for HTML-to-Markdown conversion, each optimized for different scenarios.
Your choice should align with your specific requirements:

Quick conversions: Use html2text for its simplicity and zero dependencies
Custom logic: Use markdownify for maximum flexibility through subclassing
Web scraping: Use trafilatura for intelligent content extraction with boilerplate removal
Bulk migrations: Use html2md for async performance on large-scale projects
Production systems: Use html-to-markdown for type safety and comprehensive HTML5 support
Semantic preservation: Use domscribe for maintaining HTML5 semantic structure

Recommendations for LLM Workflows

For LLM preprocessing workflows, it is recommended a two-tier approach:

Start with trafilatura for initial content extraction—it intelligently removes navigation, ads, and boilerplate while preserving the main content
Fall back to html-to-markdown for complex documents requiring precise structure preservation, such as technical documentation with tables and code blocks

This combination handles 95% of real-world scenarios effectively.

Next Steps

All these tools (except html2text) are actively maintained and production-ready. It's better to:

Install 2-3 libraries that match your use case
Test them with your actual HTML samples
Benchmark performance with your typical document sizes
Choose based on output quality, not just speed

The Python ecosystem for HTML-to-Markdown conversion has matured significantly, and you can't go wrong with any of these choices for their intended use cases.

DEV Community