DEV Community

Cover image for Portadom: A Unified Interface for DOM Manipulation
Juro Oravec
Juro Oravec

Posted on • Edited on

Portadom: A Unified Interface for DOM Manipulation

Introduction

Web scraping, while immensely useful, often requires developers to navigate a sea of tools and libraries, each with its own quirks and intricacies. Whether it's JSDOM, Cheerio, Playwright, or even just plain old vanilla JS in the DevTools console, moving between these platforms can be a challenge.

Enter Portadom, your new best friend in the world of web scraping.

What is Portadom?
Portadom provides a consistent DOM manipulation interface across:

  • Browser API
  • JSDOM
  • Cheerio
  • Playwright

This means you no longer have to rewrite or refactor large chunks of your code when switching between these tools. Instead, you can focus on the logic of your web scraping tasks and let Portadom handle the DOM manipulation intricacies.

The Portadom Workflow

Imagine you're working on a project to scrape data from several websites. You initially start by prototyping in the DevTools console using vanilla JS. Once you've figured out the transformations and data extractions, you realize that some sites can be scraped with static HTML, while others need a JS runtime.

1. Prototyping with Vanilla JS

You start with a simple site and define your transformations directly in the DevTools:

let title = document.querySelector('h1').innerText;
Enter fullscreen mode Exit fullscreen mode

2. Static HTML with JSDOM or Cheerio

For sites where static HTML is sufficient, you can easily migrate your vanilla JS logic:

import { load as loadCheerio } from 'cheerio';
import { cheerioPortadom } from 'portadom';

const html = `<h1>Welcome to Portadom</h1>`;
const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), null);

const title = await dom.findOne('h1').text();
Enter fullscreen mode Exit fullscreen mode

With Portadom, the transition feels almost seamless. The core logic remains consistent, and only the setup changes.

3. Dynamic Sites with Playwright

For websites that rely heavily on JavaScript, you'd need a tool like Playwright. But with Portadom, even this transition is smooth:

import { playwrightLocatorPortadom } from 'portadom';

const page = await browser.newPage();
await page.goto('https://example.com');

const bodyLoc = page.locator('body');
const dom = playwrightLocatorPortadom(bodyLoc, page);

const title = await dom.findOne('h1').text();
Enter fullscreen mode Exit fullscreen mode

Notice how, once again, only the setup changed. The actual DOM querying logic remains consistent, thanks to Portadom.

Embracing Flexibility

Portadom is all about flexibility. No matter where you start β€” be it with Cheerio for static HTML parsing or Playwright for dynamic sites β€” you're never locked in. If your needs change, Portadom makes it easy to switch your underlying platform without overhauling your entire scraping logic.

Take the Leap with Portadom

Web scraping is finicky - everything breaks all the time. With Portadom, you're equipped with a tool that lets you focus on crafting the perfect data extraction strategy without getting bogged down by the intricacies of various DOM manipulation libraries. Dive in and let Portadom streamline your web scraping journey!

Portadom was already successfully used in scraping:

Portadom currently supports following manipulations:

  • Element attributes, properties, text
  • findOne - equivalent to document.querySelector
  • findMany - equivalent to document.querySelectorAll
  • closest - equivalent to Element.closest
  • parent - equivalent to Element.parent
  • children - equivalent to Element.children
  • root - Get document root
  • remove - Remove current Element
  • getCommonAncestor - Get common ancestor between this and other Element
  • getCommonAncestorFromSelector - Get common ancestor between this and other Element (found by selector)

Chaining

For cross-compatibility, each method on a Portadom instance returns
a Promise.

But this then leads to then / await hell when you need to call multiple methods in a row:

const employerName = (await (await el.findOne('.employer'))?.text()) ?? null;
Enter fullscreen mode Exit fullscreen mode

To get around that, the results are wrapped in chainable instance. This applies to each method that returns a Portadom instance, or an array of Portadom instances.

So instead, we can call:

const employerName = await el.findOne('.employer').text();
Enter fullscreen mode Exit fullscreen mode

You don't have to chain the commands. Instead, you can access the associated promise under promise property. For example this:

const mapPromises = await dom.findOne('ul')
  .parent()
  .findMany('li[data-id]')
  .map((li) => li.attr('data-id'));
const attrs = await Promise.all(mapResult);
Enter fullscreen mode Exit fullscreen mode

Is the same as:

const ul = await dom.findOne('ul').promise;
const parent = await ul?.parent().promise;
const idEls = await parent?.findMany('li[data-id]').promise;
const mapPromises = idEls?.map((li) => li.attr('data-id')) ?? [];
const attrs = await Promise.all(mapPromises);
Enter fullscreen mode Exit fullscreen mode

Examples

Example - Profesia.sk

See source code here.

// Following lines added for completeness 
const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);
// ...
const rootEl = dom.root();
const url = await dom.url();

// Find and extract data
const entries = await rootEl.findMany('.list-row:not(.native-agent):not(.reach-list)')
  .mapAsyncSerial(async (el) => {
  const employerName = await el.findOne('.employer').text();
  const employerUrl = await el.findOne('.offer-company-logo-link').href();
  const employerLogoUrl = await el.findOne('.offer-company-logo-link img').src();

  const offerUrlEl = el.findOne('h2 a');
  const offerUrl = await offerUrlEl.href();
  const offerName = await offerUrlEl.text();
  const offerId = offerUrl?.match(/O\d{2,}/)?.[0] ?? null;

  const location = await el.findOne('.job-location').text();

  const salaryText = await el.findOne('.label-group > a[data-dimension7="Salary label"]').text();

  const labels = await el.findMany('.label-group > a:not([data-dimension7="Salary label"])')
    .mapAsyncSerial((el) => el.text())
    .then((arr) => arr.filter(Boolean) as string[]);

  const footerInfoEl = el.findOne('.list-footer .info');
  const lastChangeRelativeTimeEl = footerInfoEl.findOne('strong');
  const lastChangeRelativeTime = await lastChangeRelativeTimeEl.text();
  // Remove the element so it's easier to get the text content
  await lastChangeRelativeTimeEl.remove();
  const lastChangeTypeText = await footerInfoEl.textAsLower();
  const lastChangeType = lastChangeTypeText === 'pridanΓ©' ? 'added' : 'modified';

  return {
    listingUrl: url,
    employerName,
    employerUrl,
    employerLogoUrl,
    offerName,
    offerUrl,
    offerId,
    location,
    labels,
    lastChangeRelativeTime,
    lastChangeType,
  };
});
Enter fullscreen mode Exit fullscreen mode

Example - SKCRIS

See source code here.

// Following lines added for completeness. Edited for brevity.
const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);
// ...
const url = await dom.url();
const rootEl = dom.root();
const tableDataEls = await rootEl
  .findMany('.detail > tr')
  .filterAsyncSerial((el) => el.text()) // Remove empty tags
  .slice(1, -1).promise; // Remove first row (heading) and last row (related resources)

const tableData = tableDataEls.reduce(async (promiseAgg, rowEl) => {
  const agg = await promiseAgg;
  const [title, val] = await rowEl.children()
    .mapAsyncSerial(async (el) => {
      const text = await el.text();
      return text?.replace(/\s+/g, ' ') ?? null;
    });
  if (!title) return agg;

  agg[title] = val ?? null;
  return agg;
}, Promise.resolve({} as Record<string, string | null>));

return tableData;
Enter fullscreen mode Exit fullscreen mode

Example - Facebook post timestamp

Facebook prompted the need to add the getCommonAncestor method, as Facebook's HTML doesn't provide many reliable patterns to work with.

Note how we used getCommonAncestor to get an element that wasn't easily targettable by class/attribute selectors.

// Following lines added for completeness. Edited for brevity.
const body = await page.evaluateHandle(() => document.body)
const dom = playwrightHandlePortadom(body, page);
// ...
// Find container with post stats
const likesEl = await dom.findOne('[aria-label*="Like:"]').promise;
const commentsEl = await dom
  .findMany('[role="button"] [dir="auto"]')
  .findAsyncSerial(async (el) => {
    const text = await el.text();
    return text?.match(URL_REGEX.COMMENT_COUNT);
  }).promise;

const statsContainerEl =
  likesEl?.node && commentsEl?.node
    ? await likesEl.getCommonAncestor(commentsEl.node).promise
    : null;
// "6.9K views"
const viewsText = await statsContainerEl
  ?.children()
  .findAsyncSerial(async (domEl) => {
    const text = await domEl.text();
    return text?.match(/views/i);
  })
  .textAsLower();
Enter fullscreen mode Exit fullscreen mode

Learn more

Top comments (5)

Collapse
 
cheuksing profile image
Tommy Chan

There are multiple way to reduce lodash size.

  1. Most user should be using ES Modules in these days. Switching to lodash-es and let the bundler to do tree shaking.
  2. Use babel-plugin-lodash
  3. Bundle / Minify this library with a proper setup bundler.

Also this is a Web Scraping library. The source code should already be there and thus no network traffic time. Isn't it micro-optimize?

Collapse
 
jurooravec profile image
Juro Oravec

Great suggestions, thanks! Was a bit lazy on my side

Collapse
 
jurooravec profile image
Juro Oravec

Updated, it doesn't have any dependencies now :)

Collapse
 
lemoussel profile image
LeMoussel

Examples of how you scrape Facebook, Amazon Products, Profesia.sk & SKCRIS would be welcome

Collapse
 
jurooravec profile image
Juro Oravec

Added the ones I could - Facebook, Profesia.sk, SKCRIS :)