DEV Community

Željko Šević
Željko Šević

Posted on • Originally published at sevic.dev on

5

Web scraping with jsdom

Web scraping means extracting data from websites. This post covers extracting data from the page's HTML when data is stored in JavaScript variable or stringified JSON.

The scraping prerequisite is retrieving an HTML page via an HTTP client.

Examples

The example below moves data into a global variable, executes the page scripts and accesses the data from the global variable.

import jsdom from 'jsdom';

fetch(URL)
  .then((res) => res.text())
  .then((response) => {
    const dataVariable = 'someVariable.someField';
    const html = response.replace(dataVariable, `var data=${dataVariable}`);

    const dom = new jsdom.JSDOM(html, {
      runScripts: 'dangerously',
      virtualConsole: new jsdom.VirtualConsole(),
    });

    console.log('data', dom?.window?.data);
  });
Enter fullscreen mode Exit fullscreen mode

The example below runs the page scripts, and access stringified JSON data.

import jsdom from 'jsdom';

fetch(URL)
  .then((res) => res.text())
  .then((response) => {
    const dom = new jsdom.JSDOM(response, {
      runScripts: 'dangerously',
      virtualConsole: new jsdom.VirtualConsole(),
    });

    const data = dom?.window?.document?.getElementById('someId')?.value;

    console.log('data', JSON.parse(data));
  });
Enter fullscreen mode Exit fullscreen mode

Disclaimer

Please check the website's terms of service before scraping it. Some websites may have terms of service that prohibit such activity.

Course

Build your SaaS in 2 weeks - Start Now

SurveyJS custom survey software

JavaScript UI Libraries for Surveys and Forms

SurveyJS lets you build a JSON-based form management system that integrates with any backend, giving you full control over your data and no user limits. Includes support for custom question types, skip logic, integrated CCS editor, PDF export, real-time analytics & more.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay