DEV Community

Cover image for Ultimate web scraping with browserless, puppeteer and Node.js
Christian
Christian

Posted on • Originally published at cri.dev

2 2

Ultimate web scraping with browserless, puppeteer and Node.js

Originally posted on cri.dev

Browser automation built for enterprises, loved by developers.

browserless.io is a neat service for hosted puppeteer scraping, but there is also the official Docker image for running it locally.

I was amazed when I found out about it 🤯!

Find the whole source code on Github christian-fei/browserless-example!

Running browserless in docker

A one-liner is enough to have a full puppeteer backend, with configured concurrency etc., to leverage using puppeteer.

You can connect to a browserless backend by passing the option browserWSEndpoint like this:

async function createBrowser () {
  return puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}
Enter fullscreen mode Exit fullscreen mode

To start the backend you can use the following command, using the docker image browserless/chrome:

docker run \
  -e "MAX_CONCURRENT_SESSIONS=15" \
  -e "MAX_QUEUE_LENGTH=0" \
  -e "PREBOOT_CHROME=true" \
  -e "DEFAULT_BLOCK_ADS=true" \
  -e "DEFAULT_IGNORE_HTTPS_ERRORS=true" \
  -e "CONNECTION_TIMEOUT=600000" \
  -p 3000:3000 \
  --rm -it browserless/chrome
Enter fullscreen mode Exit fullscreen mode

Source code

Find the whole source code on Github christian-fei/browserless-example!

You'll find a web crawler with puppeteer!

git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i

npm run start-browserless
node crawl-with-api.js https://christianfei.com
Enter fullscreen mode Exit fullscreen mode

Puppeteer using browserless docker backend

You simply connect to the Browser WebSocket Endpoint ws://localhost:3000 and you're connected to the browserless backend!

Here is a short example of getting all links <a> on christianfei.com:

const puppeteer = require('puppeteer')

main(process.argv[2])
  .then(err => console.log('finished, exiting') && process.exit(0))
  .catch(err => console.error(err) && process.exit(1))

async function main (url = 'https://christianfei.com') {
  const browser = await createBrowser()
  const page = await browser.newPage()
  await page.goto(url)
  console.log('title', await page.title())
  const links = await page.evaluate(selector => [...document.querySelectorAll(selector)], 'a')
  console.log('links.length', links.length)
}
async function createBrowser () {
  return puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}
Enter fullscreen mode Exit fullscreen mode

An example video:

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay