Jonathan Geiger

Posted on Mar 21 • Originally published at capturekit.dev

How to Extract HTML from Web Pages with Puppeteer

#webdev #puppeteer #webscraping

Extracting HTML content from websites is a fundamental task for web scrapers, data scientists, and developers building automation tools. Puppeteer, a Node.js library developed by Google, provides a robust way to interact with web pages programmatically. In this guide, we'll explore how to extract HTML content effectively with Puppeteer and address common challenges.

What is Puppeteer?

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. It enables developers to:

Scrape web content and extract data
Automate form submissions and user interactions
Generate screenshots and PDFs
Run automated testing
Monitor website performance
Crawl single-page applications (SPAs)

Let's dive into using Puppeteer for HTML extraction.

Setting Up Puppeteer

First, install Puppeteer via npm:

npm install puppeteer

This command installs both Puppeteer and a compatible version of Chromium. If you'd prefer to use your existing Chrome installation, use puppeteer-core instead:

npm install puppeteer-core

Basic HTML Extraction

Here's a simple script to extract the entire HTML from a webpage:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Get the page's HTML content
  const html = await page.content();
  console.log(html);

  await browser.close();
})();

This script:

Launches a headless browser
Opens a new page
Navigates to https://example.com
Extracts the full HTML content
Closes the browser

Extracting HTML from Specific Elements

To extract HTML from a specific element on the page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Extract HTML from a specific element
  const elementHtml = await page.evaluate(() => {
    const element = document.querySelector('.main-content');
    return element ? element.outerHTML : null;
  });

  console.log(elementHtml);
  await browser.close();
})();

Waiting for Dynamic Content

Modern websites often load content dynamically. To ensure all content is loaded before extraction:

await page.goto('https://example.com', { 
  waitUntil: 'networkidle2' 
});

For pages with specific elements that load asynchronously:

await page.waitForSelector('.dynamic-content', { visible: true });
const html = await page.content();

Extracting Text Content

If you only need the text content without HTML tags:

const textContent = await page.evaluate(() => {
  return document.body.innerText;
});

For a specific element:

const elementText = await page.$eval('.article', el => el.textContent);

Extracting Metadata

To extract a webpage's metadata like title, description, and Open Graph data:

const metadata = await page.evaluate(() => {
  return {
    title: document.title,
    description: document.querySelector('meta[name="description"]')?.content || null,
    ogTitle: document.querySelector('meta[property="og:title"]')?.content || null,
    ogDescription: document.querySelector('meta[property="og:description"]')?.content || null,
    ogImage: document.querySelector('meta[property="og:image"]')?.content || null
  };
});

console.log(metadata);

Extracting Links

To extract all links from a webpage:

const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => {
    return {
      text: a.textContent.trim(),
      href: a.href
    };
  });
});

console.log(links);

Handling Authentication

For websites that require authentication:

await page.goto('https://example.com/login');
await page.type('#username', 'your_username');
await page.type('#password', 'your_password');
await page.click('#login-button');
await page.waitForNavigation();

// Now that we're logged in, extract the protected content
const html = await page.content();

Avoiding Detection

Many websites implement anti-bot measures. Use stealth mode to avoid detection:

npm install puppeteer-extra puppeteer-extra-plugin-stealth

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

// Now use puppeteer as usual
const browser = await puppeteer.launch();

Saving Extracted HTML to a File

To save the extracted HTML to a file:

const fs = require('fs');

// Extract HTML
const html = await page.content();

// Write to file
fs.writeFileSync('extracted-page.html', html);

Working with iframes

To extract HTML from an iframe:

const frameContent = await page.frames()[1].content(); // gets content from the second frame

// Or find a frame by its name
const namedFrame = page.frames().find(frame => frame.name() === 'frameName');
const namedFrameContent = await namedFrame.content();

Alternative to Puppeteer: CaptureKit API

Setting up and maintaining Puppeteer for HTML extraction can be challenging. If you need a reliable, scalable solution without infrastructure headaches, consider using CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_html=true"

Benefits of CaptureKit API

Complete Solution: Extract not just HTML, but also metadata, links, and structured content
No Browser Management: No need to maintain browser instances
Scale Effortlessly: Handle high-volume extraction without infrastructure concerns

Example Response from CaptureKit API:

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
      "description": "Tailwind CSS is a utility-first CSS framework.",
      "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
      "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
    },
    "links": {
      "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
      "external": ["https://tailwindui.com", "https://shopify.com"],
      "social": [
        "https://github.com/tailwindlabs/tailwindcss",
        "https://x.com/tailwindcss"
      ]
    },
    "html": "<html><body><h1>Hello, world!</h1></body></html>"
  }
}

Conclusion

Puppeteer offers powerful capabilities for extracting HTML from websites, but it can be complex to set up and maintain. For developers who need a reliable, maintenance-free solution that provides more than just raw HTML, CaptureKit API offers a compelling alternative with comprehensive data extraction capabilities.

DEV Community