Simplifying Documentation using IBM Bob to Create My New Personal Project: HTML/URL to Markdown Converter!
Introduction (and spoiler)
I know, I know — take a deep breath and try to remain calm. This project features exactly zero autonomous AI agents, no sophisticated RAG pipelines, and — brace yourselves — not a single “groundbreaking” LLM integration to disrupt the industry. I’m truly sorry to disappoint the hype-train, but as often happens when I’m left to my own devices, I had a burning need for a tool that actually does something simple. So, instead of building a sentient toaster, I decided to create my own URL-to-Markdown converter for Chrome. It’s just a humble extension designed to solve a real problem, though, in a classic display of “do as I say, not as I do,” I naturally just sat back and made Bob do all the actual manual labor to realize it for me. Simple, right?
The Implementation
The application is structured as a modular utility that can handle both local files and live web URLs, converting messy HTML into clean, standardized Markdown. The core logic relies on the Turndown library, enhanced by custom rules for GitHub-Flavored Markdown (GFM).
Core Conversion Engine
At the heart of every script lies the TurndownService. The logic follows a consistent pipeline:
-
Initialization: Configures
headingStyle(ATX vs Setext) andcodeBlockStyle(fenced). -
Plugin Integration: Uses
turndown-plugin-gfmto ensure tables and task lists are preserved. -
Custom Rules: Specifically targets elements like
<pre>tags to ensure syntax highlighting is maintained in the output.
#!/usr/bin/env node
/**
* Node.js CLI script to convert HTML files to Markdown
* Usage: node scripts/convert-file.js <input-file> [output-file]
*/
const fs = require('fs');
const path = require('path');
// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');
// Verify JSDOM is available for Node.js
try {
require('jsdom');
} catch (e) {
console.error('Error: JSDOM is required for Node.js usage.');
console.error('Install it with: npm install jsdom');
process.exit(1);
}
// Parse command line arguments
const args = process.argv.slice(2);
if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
console.log(`
HTML to Markdown Converter - CLI Tool
Usage:
node scripts/convert-file.js <input-file> [output-file]
node scripts/convert-file.js --help
Arguments:
input-file Path to the HTML file to convert
output-file (Optional) Path for the output Markdown file
If not provided, will create a timestamped file in output/
Options:
--help, -h Show this help message
Examples:
node scripts/convert-file.js input/sample.html
node scripts/convert-file.js input/sample.html output/result.md
node scripts/convert-file.js page.html
`);
process.exit(0);
}
const inputFile = args[0];
let outputFile = args[1];
// Check if input file exists
if (!fs.existsSync(inputFile)) {
console.error(`Error: Input file '${inputFile}' not found.`);
process.exit(1);
}
// Read the HTML file
console.log(`Reading HTML from: ${inputFile}`);
const html = fs.readFileSync(inputFile, 'utf8');
// Create converter instance
const converter = new HtmlToMarkdown({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
bulletListMarker: '-',
strongDelimiter: '**',
emDelimiter: '_'
});
// Convert to Markdown
console.log('Converting HTML to Markdown...');
const markdown = converter.convert(html);
// Generate output filename if not provided
if (!outputFile) {
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
const inputBasename = path.basename(inputFile, path.extname(inputFile));
outputFile = path.join('output', `${inputBasename}-${timestamp}.md`);
// Ensure output directory exists
const outputDir = path.dirname(outputFile);
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
}
// Write the Markdown file
fs.writeFileSync(outputFile, markdown, 'utf8');
console.log(`✓ Conversion successful!`);
console.log(`Output saved to: ${outputFile}`);
console.log(`File size: ${markdown.length} characters`);
// Made with Bob
Data Acquisition Logic
The application handles three distinct entry points:
| Entry Point | Logic Flow |
| --------------- | ------------------------------------------------------------ |
| **Local File** | Reads raw HTML from the filesystem using `fs.readFileSync`. |
| **Simple URL** | Fetches remote HTML via `axios`, then parses the DOM. |
| **Complex URL** | Uses **JSDOM** to simulate a browser environment, allowing for cleaner extraction of titles and metadata before conversion. |
Image Handling & Asset Management
As detailed in Image-Handling.md, the tool doesn't just convert text; it manages media context:
-
Absolute URL Resolution: Logic converts relative paths (
/img/photo.png)into absolute URLs based on the source domain. -
Alt-Text Preservation: Ensures accessibility by mapping alt attributes to Markdown image syntax:
.
#!/usr/bin/env node
/**
* Node.js CLI script to convert HTML from a URL to Markdown
* Usage: node scripts/convert-url.js <url> [output-file]
*/
const fs = require('fs');
const path = require('path');
const https = require('https');
const http = require('http');
// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');
// Verify JSDOM is available
try {
require('jsdom');
} catch (e) {
console.error('Error: JSDOM is required for Node.js usage.');
console.error('Install it with: npm install jsdom');
process.exit(1);
}
// Parse command line arguments
const args = process.argv.slice(2);
if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
console.log(`
HTML to Markdown Converter - URL Fetcher
Usage:
node scripts/convert-url.js <url> [output-file]
node scripts/convert-url.js --help
Arguments:
url URL of the webpage to convert
output-file (Optional) Path for the output Markdown file
If not provided, will create a timestamped file in output/
Options:
--help, -h Show this help message
Examples:
node scripts/convert-url.js https://example.com
node scripts/convert-url.js https://example.com output/example.md
node scripts/convert-url.js https://github.com/user/repo/blob/main/README.md
`);
process.exit(0);
}
const url = args[0];
let outputFile = args[1];
// Validate URL
try {
new URL(url);
} catch (e) {
console.error(`Error: Invalid URL '${url}'`);
console.error('Please provide a valid URL starting with http:// or https://');
process.exit(1);
}
/**
* Fetch HTML content from URL
*/
function fetchUrl(url) {
return new Promise((resolve, reject) => {
const protocol = url.startsWith('https') ? https : http;
console.log(`Fetching content from: ${url}`);
const request = protocol.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; HTML-to-Markdown-Converter/1.0)'
}
}, (response) => {
// Handle redirects
if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
console.log(`Following redirect to: ${response.headers.location}`);
fetchUrl(response.headers.location).then(resolve).catch(reject);
return;
}
if (response.statusCode !== 200) {
reject(new Error(`HTTP ${response.statusCode}: ${response.statusMessage}`));
return;
}
let data = '';
response.on('data', chunk => data += chunk);
response.on('end', () => resolve(data));
});
request.on('error', reject);
request.setTimeout(30000, () => {
request.destroy();
reject(new Error('Request timeout after 30 seconds'));
});
});
}
/**
* Extract domain name from URL for filename
*/
function getDomainName(url) {
try {
const urlObj = new URL(url);
return urlObj.hostname.replace(/^www\./, '');
} catch (e) {
return 'webpage';
}
}
// Main execution
(async () => {
try {
// Fetch HTML from URL
const html = await fetchUrl(url);
if (!html || html.trim().length === 0) {
console.error('Error: No content received from URL');
process.exit(1);
}
console.log(`✓ Content fetched (${html.length} characters)`);
// Create converter instance
const converter = new HtmlToMarkdown({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
bulletListMarker: '-',
strongDelimiter: '**',
emDelimiter: '_'
});
// Convert to Markdown
console.log('Converting HTML to Markdown...');
const markdown = converter.convert(html);
// Generate output filename if not provided
if (!outputFile) {
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
const domain = getDomainName(url);
outputFile = path.join('output', `${domain}-${timestamp}.md`);
// Ensure output directory exists
const outputDir = path.dirname(outputFile);
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
}
// Write the Markdown file
fs.writeFileSync(outputFile, markdown, 'utf8');
console.log(`✓ Conversion successful!`);
console.log(`Output saved to: ${outputFile}`);
console.log(`File size: ${markdown.length} characters`);
console.log(`\nPreview (first 500 characters):`);
console.log('─'.repeat(60));
console.log(markdown.substring(0, 500) + (markdown.length > 500 ? '...' : ''));
console.log('─'.repeat(60));
} catch (error) {
console.error(`\n✗ Error: ${error.message}`);
if (error.code === 'ENOTFOUND') {
console.error('Could not resolve hostname. Please check the URL and your internet connection.');
} else if (error.code === 'ECONNREFUSED') {
console.error('Connection refused. The server may be down or unreachable.');
} else if (error.message.includes('timeout')) {
console.error('Request timed out. The server may be slow or unresponsive.');
}
process.exit(1);
}
})();
// Made with Bob
Application Workflow
The following logic flow governs the transition from a URL to a downloaded Markdown file:
- Request Phase: User provides a URL or file path.
- Extraction Phase: The system fetches the content: it isolates the main content body (often stripping headers/footers if logic is applied).
-
Transformation Phase: HTML tags are mapped to Markdown equivalents (e.g.,
becomes #) : code blocks are wrapped in triple backticks.
- Finalization: Metadata (Title, Date, Source URL) is prepended as Front Matter: the resulting string is saved as a .md file.
Last but not Least: Extension for a Browser
The extension follows the standard Chrome “V3” architecture, distributed across three main functional areas:
The Manifest (Orchestrator)
-
File:
manifest.json -
Role: Defines the permissions and entry points. It specifically requests
activeTabaccess to read the page content anddownloadspermissions to save the final.mdfile to your computer.
The Popup (User Interface)
- Files: popup.html, popup.js
/**
* Chrome Extension Popup Script
* Handles user interactions and conversion logic
*/
let currentMarkdown = '';
// Initialize when popup opens
document.addEventListener('DOMContentLoaded', () => {
const convertBtn = document.getElementById('convert-btn');
const copyBtn = document.getElementById('copy-btn');
const downloadBtn = document.getElementById('download-btn');
const markdownOutput = document.getElementById('markdown-output');
const headingStyle = document.getElementById('heading-style');
const codeStyle = document.getElementById('code-style');
// Convert button click
convertBtn.addEventListener('click', async () => {
try {
showStatus('Converting...', 'info');
convertBtn.disabled = true;
// Get the active tab
const [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
// Execute script in the page to get HTML
const results = await chrome.scripting.executeScript({
target: { tabId: tab.id },
function: getPageHTML
});
if (results && results[0] && results[0].result) {
const html = results[0].result;
// Convert HTML to Markdown
const converter = new HtmlToMarkdown({
headingStyle: headingStyle.value,
codeBlockStyle: codeStyle.value
});
currentMarkdown = converter.convert(html);
markdownOutput.value = currentMarkdown;
// Enable buttons
copyBtn.disabled = false;
downloadBtn.disabled = false;
showStatus('✓ Conversion successful!', 'success');
} else {
throw new Error('Could not retrieve page content');
}
} catch (error) {
console.error('Conversion error:', error);
showStatus('✗ Error: ' + error.message, 'error');
} finally {
convertBtn.disabled = false;
}
});
// Copy button click
copyBtn.addEventListener('click', async () => {
try {
await navigator.clipboard.writeText(currentMarkdown);
showStatus('✓ Copied to clipboard!', 'success');
// Visual feedback
const originalText = copyBtn.textContent;
copyBtn.textContent = '✓ Copied!';
setTimeout(() => {
copyBtn.textContent = originalText;
}, 2000);
} catch (error) {
console.error('Copy error:', error);
showStatus('✗ Failed to copy', 'error');
}
});
// Download button click
downloadBtn.addEventListener('click', () => {
try {
const blob = new Blob([currentMarkdown], { type: 'text/markdown' });
const url = URL.createObjectURL(blob);
const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
const filename = `converted-${timestamp}.md`;
const a = document.createElement('a');
a.href = url;
a.download = filename;
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
URL.revokeObjectURL(url);
showStatus('✓ Downloaded!', 'success');
} catch (error) {
console.error('Download error:', error);
showStatus('✗ Failed to download', 'error');
}
});
// Update conversion when options change
headingStyle.addEventListener('change', () => {
if (currentMarkdown) {
convertBtn.click();
}
});
codeStyle.addEventListener('change', () => {
if (currentMarkdown) {
convertBtn.click();
}
});
});
/**
* Function to be executed in the page context
* Gets the HTML content of the page
*/
function getPageHTML() {
// Get the main content, preferring article or main tags
const article = document.querySelector('article');
const main = document.querySelector('main');
const body = document.body;
// Return the most relevant content
if (article) {
return article.outerHTML;
} else if (main) {
return main.outerHTML;
} else {
return body.innerHTML;
}
}
/**
* Show status message
*/
function showStatus(message, type) {
const status = document.getElementById('status');
status.textContent = message;
status.className = `status ${type}`;
status.style.display = 'block';
// Auto-hide after 3 seconds for success messages
if (type === 'success') {
setTimeout(() => {
status.style.display = 'none';
}, 3000);
}
}
// Made with Bob
- Logic: This is the “control center.”
- Configuration: It allows users to toggle settings like “Heading Style” (ATX vs. Setext) and “Code Block Style” (Fenced vs. Indented) directly from a dropdown.
- Communication: When the “Convert” button is clicked, popup.js sends a message to the content script to grab the page data.
- Trigger: Once the Markdown is returned, it creates a Blob and triggers a download via the Chrome Downloads API.
The Content Script (The “Brain”)
-
File:
content.js - Logic: This script lives inside the webpage you are viewing.
-
Extraction: It scrapes the current
DOM, targeting thedocument.body.innerHTML. -
Transformation: It bundles the Turndown library logic to convert that HTML string into Markdown on the fly. -
Metadata: It automatically grabs the page<title>and URL to create a header for your document.
/**
* Content Script
* Runs in the context of web pages
* Can be used for additional features like context menu conversion
*/
// Listen for messages from the extension
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.action === 'getHTML') {
// Get the HTML content
const html = document.documentElement.outerHTML;
sendResponse({ html });
}
return true;
});
// Optional: Add keyboard shortcut for quick conversion
document.addEventListener('keydown', (e) => {
// Ctrl+Shift+M or Cmd+Shift+M to trigger conversion
if ((e.ctrlKey || e.metaKey) && e.shiftKey && e.key === 'M') {
e.preventDefault();
// Open the extension popup programmatically
chrome.runtime.sendMessage({ action: 'openPopup' });
}
});
console.log('HTML to Markdown extension loaded');
// Made with Bob
Technical Stack Summary
- Runtime: Node.js
- Parsing:
jsdom(for DOM manipulation) - Conversion:
turndown+turndown-plugin-gfm - HTTP Client:
axios - CLI Assets:
create-icons.sh(for generating Chrome Extension icons)
Note from Bob: The logic is designed to be “
plug-and-play.” Whether you are running a script in the terminal or clicking the extension button, the underlying conversion rules remain identical to ensure consistent output quality.
Conclusion
To wrap this all up, there is a distinct, borderline-obsessive satisfaction in using a tool where you know every single line of the source code. Sure, there are dozens of converters out there, but this one is my precious code, and that makes it inherently superior.
The real kicker? The entire transition from “I have a burning need” to a fully functional Chrome extension and CLI suite happened in less than 30 minutes. By offloading the heavy lifting to Bob — who, per my strict instructions, ensured everything was backed by unit tests — the development cycle was essentially at warp speed.
>>> Thanks for reading <<<
Links
- Code repository: https://github.com/aairom/html-to-markdown
- IBM Bob: https://bob.ibm.com/





Top comments (0)