Alain Airom (Ayrom)

Posted on Apr 26

Markdown Everything: My New Personal Project: HTML/URL to Markdown Converter

#bob #ibmbob #documentconverter #html2markdown

Simplifying Documentation using IBM Bob to Create My New Personal Project: HTML/URL to Markdown Converter!

Introduction (and spoiler)

I know, I know — take a deep breath and try to remain calm. This project features exactly zero autonomous AI agents, no sophisticated RAG pipelines, and — brace yourselves — not a single “groundbreaking” LLM integration to disrupt the industry. I’m truly sorry to disappoint the hype-train, but as often happens when I’m left to my own devices, I had a burning need for a tool that actually does something simple. So, instead of building a sentient toaster, I decided to create my own URL-to-Markdown converter for Chrome. It’s just a humble extension designed to solve a real problem, though, in a classic display of “do as I say, not as I do,” I naturally just sat back and made Bob do all the actual manual labor to realize it for me. Simple, right?

The Implementation

The application is structured as a modular utility that can handle both local files and live web URLs, converting messy HTML into clean, standardized Markdown. The core logic relies on the Turndown library, enhanced by custom rules for GitHub-Flavored Markdown (GFM).

Core Conversion Engine

At the heart of every script lies the TurndownService. The logic follows a consistent pipeline:

Initialization: Configures headingStyle (ATX vs Setext) and codeBlockStyle (fenced).
Plugin Integration: Uses turndown-plugin-gfm to ensure tables and task lists are preserved.
Custom Rules: Specifically targets elements like <pre> tags to ensure syntax highlighting is maintained in the output.

#!/usr/bin/env node

/**
 * Node.js CLI script to convert HTML files to Markdown
 * Usage: node scripts/convert-file.js <input-file> [output-file]
 */

const fs = require('fs');
const path = require('path');

// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');

// Verify JSDOM is available for Node.js
try {
  require('jsdom');
} catch (e) {
  console.error('Error: JSDOM is required for Node.js usage.');
  console.error('Install it with: npm install jsdom');
  process.exit(1);
}

// Parse command line arguments
const args = process.argv.slice(2);

if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
  console.log(`
HTML to Markdown Converter - CLI Tool

Usage:
  node scripts/convert-file.js <input-file> [output-file]
  node scripts/convert-file.js --help

Arguments:
  input-file    Path to the HTML file to convert
  output-file   (Optional) Path for the output Markdown file
                If not provided, will create a timestamped file in output/

Options:
  --help, -h    Show this help message

Examples:
  node scripts/convert-file.js input/sample.html
  node scripts/convert-file.js input/sample.html output/result.md
  node scripts/convert-file.js page.html
  `);
  process.exit(0);
}

const inputFile = args[0];
let outputFile = args[1];

// Check if input file exists
if (!fs.existsSync(inputFile)) {
  console.error(`Error: Input file '${inputFile}' not found.`);
  process.exit(1);
}

// Read the HTML file
console.log(`Reading HTML from: ${inputFile}`);
const html = fs.readFileSync(inputFile, 'utf8');

// Create converter instance
const converter = new HtmlToMarkdown({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced',
  bulletListMarker: '-',
  strongDelimiter: '**',
  emDelimiter: '_'
});

// Convert to Markdown
console.log('Converting HTML to Markdown...');
const markdown = converter.convert(html);

// Generate output filename if not provided
if (!outputFile) {
  const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
  const inputBasename = path.basename(inputFile, path.extname(inputFile));
  outputFile = path.join('output', `${inputBasename}-${timestamp}.md`);

  // Ensure output directory exists
  const outputDir = path.dirname(outputFile);
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir, { recursive: true });
  }
}

// Write the Markdown file
fs.writeFileSync(outputFile, markdown, 'utf8');

console.log(`✓ Conversion successful!`);
console.log(`Output saved to: ${outputFile}`);
console.log(`File size: ${markdown.length} characters`);

// Made with Bob

Data Acquisition Logic

The application handles three distinct entry points:

| Entry Point     | Logic Flow                                                   |
| --------------- | ------------------------------------------------------------ |
| **Local File**  | Reads raw HTML from the filesystem using `fs.readFileSync`.  |
| **Simple URL**  | Fetches remote HTML via `axios`, then parses the DOM.        |
| **Complex URL** | Uses **JSDOM** to simulate a browser environment, allowing for cleaner extraction of titles and metadata before conversion. |

Image Handling & Asset Management

As detailed in Image-Handling.md, the tool doesn't just convert text; it manages media context:

Absolute URL Resolution: Logic converts relative paths (/img/photo.png) into absolute URLs based on the source domain.
Alt-Text Preservation: Ensures accessibility by mapping alt attributes to Markdown image syntax: ![alt](url).

#!/usr/bin/env node

/**
 * Node.js CLI script to convert HTML from a URL to Markdown
 * Usage: node scripts/convert-url.js <url> [output-file]
 */

const fs = require('fs');
const path = require('path');
const https = require('https');
const http = require('http');

// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');

// Verify JSDOM is available
try {
  require('jsdom');
} catch (e) {
  console.error('Error: JSDOM is required for Node.js usage.');
  console.error('Install it with: npm install jsdom');
  process.exit(1);
}

// Parse command line arguments
const args = process.argv.slice(2);

if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
  console.log(`
HTML to Markdown Converter - URL Fetcher

Usage:
  node scripts/convert-url.js <url> [output-file]
  node scripts/convert-url.js --help

Arguments:
  url           URL of the webpage to convert
  output-file   (Optional) Path for the output Markdown file
                If not provided, will create a timestamped file in output/

Options:
  --help, -h    Show this help message

Examples:
  node scripts/convert-url.js https://example.com
  node scripts/convert-url.js https://example.com output/example.md
  node scripts/convert-url.js https://github.com/user/repo/blob/main/README.md
  `);
  process.exit(0);
}

const url = args[0];
let outputFile = args[1];

// Validate URL
try {
  new URL(url);
} catch (e) {
  console.error(`Error: Invalid URL '${url}'`);
  console.error('Please provide a valid URL starting with http:// or https://');
  process.exit(1);
}

/**
 * Fetch HTML content from URL
 */
function fetchUrl(url) {
  return new Promise((resolve, reject) => {
    const protocol = url.startsWith('https') ? https : http;

    console.log(`Fetching content from: ${url}`);

    const request = protocol.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; HTML-to-Markdown-Converter/1.0)'
      }
    }, (response) => {
      // Handle redirects
      if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
        console.log(`Following redirect to: ${response.headers.location}`);
        fetchUrl(response.headers.location).then(resolve).catch(reject);
        return;
      }

      if (response.statusCode !== 200) {
        reject(new Error(`HTTP ${response.statusCode}: ${response.statusMessage}`));
        return;
      }

      let data = '';
      response.on('data', chunk => data += chunk);
      response.on('end', () => resolve(data));
    });

    request.on('error', reject);
    request.setTimeout(30000, () => {
      request.destroy();
      reject(new Error('Request timeout after 30 seconds'));
    });
  });
}

/**
 * Extract domain name from URL for filename
 */
function getDomainName(url) {
  try {
    const urlObj = new URL(url);
    return urlObj.hostname.replace(/^www\./, '');
  } catch (e) {
    return 'webpage';
  }
}

// Main execution
(async () => {
  try {
    // Fetch HTML from URL
    const html = await fetchUrl(url);

    if (!html || html.trim().length === 0) {
      console.error('Error: No content received from URL');
      process.exit(1);
    }

    console.log(`✓ Content fetched (${html.length} characters)`);

    // Create converter instance
    const converter = new HtmlToMarkdown({
      headingStyle: 'atx',
      codeBlockStyle: 'fenced',
      bulletListMarker: '-',
      strongDelimiter: '**',
      emDelimiter: '_'
    });

    // Convert to Markdown
    console.log('Converting HTML to Markdown...');
    const markdown = converter.convert(html);

    // Generate output filename if not provided
    if (!outputFile) {
      const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
      const domain = getDomainName(url);
      outputFile = path.join('output', `${domain}-${timestamp}.md`);

      // Ensure output directory exists
      const outputDir = path.dirname(outputFile);
      if (!fs.existsSync(outputDir)) {
        fs.mkdirSync(outputDir, { recursive: true });
      }
    }

    // Write the Markdown file
    fs.writeFileSync(outputFile, markdown, 'utf8');

    console.log(`✓ Conversion successful!`);
    console.log(`Output saved to: ${outputFile}`);
    console.log(`File size: ${markdown.length} characters`);
    console.log(`\nPreview (first 500 characters):`);
    console.log('─'.repeat(60));
    console.log(markdown.substring(0, 500) + (markdown.length > 500 ? '...' : ''));
    console.log('─'.repeat(60));

  } catch (error) {
    console.error(`\n✗ Error: ${error.message}`);

    if (error.code === 'ENOTFOUND') {
      console.error('Could not resolve hostname. Please check the URL and your internet connection.');
    } else if (error.code === 'ECONNREFUSED') {
      console.error('Connection refused. The server may be down or unreachable.');
    } else if (error.message.includes('timeout')) {
      console.error('Request timed out. The server may be slow or unresponsive.');
    }

    process.exit(1);
  }
})();

// Made with Bob

Application Workflow

The following logic flow governs the transition from a URL to a downloaded Markdown file:

Request Phase: User provides a URL or file path.
Extraction Phase: The system fetches the content: it isolates the main content body (often stripping headers/footers if logic is applied).
Transformation Phase: HTML tags are mapped to Markdown equivalents (e.g.,
becomes #) : code blocks are wrapped in triple backticks.
Finalization: Metadata (Title, Date, Source URL) is prepended as Front Matter: the resulting string is saved as a .md file.

Last but not Least: Extension for a Browser

The extension follows the standard Chrome “V3” architecture, distributed across three main functional areas:

The Manifest (Orchestrator)

File: manifest.json
Role: Defines the permissions and entry points. It specifically requests activeTab access to read the page content and downloads permissions to save the final .md file to your computer.

The Popup (User Interface)

Files: popup.html, popup.js

/**
 * Chrome Extension Popup Script
 * Handles user interactions and conversion logic
 */

let currentMarkdown = '';

// Initialize when popup opens
document.addEventListener('DOMContentLoaded', () => {
  const convertBtn = document.getElementById('convert-btn');
  const copyBtn = document.getElementById('copy-btn');
  const downloadBtn = document.getElementById('download-btn');
  const markdownOutput = document.getElementById('markdown-output');
  const headingStyle = document.getElementById('heading-style');
  const codeStyle = document.getElementById('code-style');

  // Convert button click
  convertBtn.addEventListener('click', async () => {
    try {
      showStatus('Converting...', 'info');
      convertBtn.disabled = true;

      // Get the active tab
      const [tab] = await chrome.tabs.query({ active: true, currentWindow: true });

      // Execute script in the page to get HTML
      const results = await chrome.scripting.executeScript({
        target: { tabId: tab.id },
        function: getPageHTML
      });

      if (results && results[0] && results[0].result) {
        const html = results[0].result;

        // Convert HTML to Markdown
        const converter = new HtmlToMarkdown({
          headingStyle: headingStyle.value,
          codeBlockStyle: codeStyle.value
        });

        currentMarkdown = converter.convert(html);
        markdownOutput.value = currentMarkdown;

        // Enable buttons
        copyBtn.disabled = false;
        downloadBtn.disabled = false;

        showStatus('✓ Conversion successful!', 'success');
      } else {
        throw new Error('Could not retrieve page content');
      }
    } catch (error) {
      console.error('Conversion error:', error);
      showStatus('✗ Error: ' + error.message, 'error');
    } finally {
      convertBtn.disabled = false;
    }
  });

  // Copy button click
  copyBtn.addEventListener('click', async () => {
    try {
      await navigator.clipboard.writeText(currentMarkdown);
      showStatus('✓ Copied to clipboard!', 'success');

      // Visual feedback
      const originalText = copyBtn.textContent;
      copyBtn.textContent = '✓ Copied!';
      setTimeout(() => {
        copyBtn.textContent = originalText;
      }, 2000);
    } catch (error) {
      console.error('Copy error:', error);
      showStatus('✗ Failed to copy', 'error');
    }
  });

  // Download button click
  downloadBtn.addEventListener('click', () => {
    try {
      const blob = new Blob([currentMarkdown], { type: 'text/markdown' });
      const url = URL.createObjectURL(blob);
      const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
      const filename = `converted-${timestamp}.md`;

      const a = document.createElement('a');
      a.href = url;
      a.download = filename;
      document.body.appendChild(a);
      a.click();
      document.body.removeChild(a);
      URL.revokeObjectURL(url);

      showStatus('✓ Downloaded!', 'success');
    } catch (error) {
      console.error('Download error:', error);
      showStatus('✗ Failed to download', 'error');
    }
  });

  // Update conversion when options change
  headingStyle.addEventListener('change', () => {
    if (currentMarkdown) {
      convertBtn.click();
    }
  });

  codeStyle.addEventListener('change', () => {
    if (currentMarkdown) {
      convertBtn.click();
    }
  });
});

/**
 * Function to be executed in the page context
 * Gets the HTML content of the page
 */
function getPageHTML() {
  // Get the main content, preferring article or main tags
  const article = document.querySelector('article');
  const main = document.querySelector('main');
  const body = document.body;

  // Return the most relevant content
  if (article) {
    return article.outerHTML;
  } else if (main) {
    return main.outerHTML;
  } else {
    return body.innerHTML;
  }
}

/**
 * Show status message
 */
function showStatus(message, type) {
  const status = document.getElementById('status');
  status.textContent = message;
  status.className = `status ${type}`;
  status.style.display = 'block';

  // Auto-hide after 3 seconds for success messages
  if (type === 'success') {
    setTimeout(() => {
      status.style.display = 'none';
    }, 3000);
  }
}

// Made with Bob

Logic: This is the “control center.”
Configuration: It allows users to toggle settings like “Heading Style” (ATX vs. Setext) and “Code Block Style” (Fenced vs. Indented) directly from a dropdown.
Communication: When the “Convert” button is clicked, popup.js sends a message to the content script to grab the page data.
Trigger: Once the Markdown is returned, it creates a Blob and triggers a download via the Chrome Downloads API.

The Content Script (The “Brain”)

File: content.js
Logic: This script lives inside the webpage you are viewing.
Extraction: It scrapes the current DOM, targeting the document.body.innerHTML.
Transformation: It bundles the Turndown library logic to convert that HTML string into Markdown on the fly.
Metadata: It automatically grabs the page <title> and URL to create a header for your document.

/**
 * Content Script
 * Runs in the context of web pages
 * Can be used for additional features like context menu conversion
 */

// Listen for messages from the extension
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.action === 'getHTML') {
    // Get the HTML content
    const html = document.documentElement.outerHTML;
    sendResponse({ html });
  }
  return true;
});

// Optional: Add keyboard shortcut for quick conversion
document.addEventListener('keydown', (e) => {
  // Ctrl+Shift+M or Cmd+Shift+M to trigger conversion
  if ((e.ctrlKey || e.metaKey) && e.shiftKey && e.key === 'M') {
    e.preventDefault();
    // Open the extension popup programmatically
    chrome.runtime.sendMessage({ action: 'openPopup' });
  }
});

console.log('HTML to Markdown extension loaded');

// Made with Bob

Technical Stack Summary

Runtime: Node.js
Parsing: jsdom (for DOM manipulation)
Conversion: turndown + turndown-plugin-gfm
HTTP Client: axios
CLI Assets: create-icons.sh (for generating Chrome Extension icons)

Note from Bob: The logic is designed to be “plug-and-play.” Whether you are running a script in the terminal or clicking the extension button, the underlying conversion rules remain identical to ensure consistent output quality.

Conclusion

To wrap this all up, there is a distinct, borderline-obsessive satisfaction in using a tool where you know every single line of the source code. Sure, there are dozens of converters out there, but this one is my precious code, and that makes it inherently superior.

The real kicker? The entire transition from “I have a burning need” to a fully functional Chrome extension and CLI suite happened in less than 30 minutes. By offloading the heavy lifting to Bob — who, per my strict instructions, ensured everything was backed by unit tests — the development cycle was essentially at warp speed.

>>> Thanks for reading <<<

DEV Community