DEV Community

loading...
Cover image for Ultimate web scraping with browserless, puppeteer and Node.js

Ultimate web scraping with browserless, puppeteer and Node.js

christianfei profile image Christian Originally published at cri.dev ・2 min read

Originally posted on cri.dev

Browser automation built for enterprises, loved by developers.

browserless.io is a neat service for hosted puppeteer scraping, but there is also the official Docker image for running it locally.

I was amazed when I found out about it 🤯!

Find the whole source code on Github christian-fei/browserless-example!

Running browserless in docker

A one-liner is enough to have a full puppeteer backend, with configured concurrency etc., to leverage using puppeteer.

You can connect to a browserless backend by passing the option browserWSEndpoint like this:

async function createBrowser () {
  return puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}
Enter fullscreen mode Exit fullscreen mode

To start the backend you can use the following command, using the docker image browserless/chrome:

docker run \
  -e "MAX_CONCURRENT_SESSIONS=15" \
  -e "MAX_QUEUE_LENGTH=0" \
  -e "PREBOOT_CHROME=true" \
  -e "DEFAULT_BLOCK_ADS=true" \
  -e "DEFAULT_IGNORE_HTTPS_ERRORS=true" \
  -e "CONNECTION_TIMEOUT=600000" \
  -p 3000:3000 \
  --rm -it browserless/chrome
Enter fullscreen mode Exit fullscreen mode

Source code

Find the whole source code on Github christian-fei/browserless-example!

You'll find a web crawler with puppeteer!

git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i

npm run start-browserless
node crawl-with-api.js https://christianfei.com
Enter fullscreen mode Exit fullscreen mode

Puppeteer using browserless docker backend

You simply connect to the Browser WebSocket Endpoint ws://localhost:3000 and you're connected to the browserless backend!

Here is a short example of getting all links <a> on christianfei.com:

const puppeteer = require('puppeteer')

main(process.argv[2])
  .then(err => console.log('finished, exiting') && process.exit(0))
  .catch(err => console.error(err) && process.exit(1))

async function main (url = 'https://christianfei.com') {
  const browser = await createBrowser()
  const page = await browser.newPage()
  await page.goto(url)
  console.log('title', await page.title())
  const links = await page.evaluate(selector => [...document.querySelectorAll(selector)], 'a')
  console.log('links.length', links.length)
}
async function createBrowser () {
  return puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}
Enter fullscreen mode Exit fullscreen mode

An example video:

Discussion (0)

pic
Editor guide