DEV Community

Cover image for Web Scraping Fundamentals with Puppeteer and Node
Omojola Tomiloba David
Omojola Tomiloba David

Posted on

Web Scraping Fundamentals with Puppeteer and Node

Hey Techies! Today I'm excited to dive into a fascinating topic in the web development community: web scraping.
In more detail, we'll explore how you can use the dynamic duo of Puppeteer and Node.js to collect data from websites like a pro.

What is Web Scraping

Let's talk about what web scraping actually is. Basically, it is the process of extracting information from websites and storing it for further analysis or use. Whether you're building a price comparison tool, collecting market research data, or just to satisfy your curiosity, web scraping can be a powerful tool in your developer toolbox.

Introducing Puppeteer

So why Puppeteer and Node.js? Puppeteer is a Node library that provides an advanced API for controlling headless Chrome or Chromium using the DevTools protocol. Simply put, it allows you to automate interactions with web pages, such as clicking buttons, filling out forms, and yes, scraping data. And with the flexibility and versatility of Node.js, the possibilities are endless. Now let's get to work.
Here are step-by-step instructions to help you start web scraping with Puppeteer and Node.js:

Environment Setup:

First, make sure Node.js is installed on your computer. You can download it from the official Node.js website if you haven't already. Once Node.js is configured, you can setup a node server

npm init -y
Enter fullscreen mode Exit fullscreen mode

you can then go on to install Puppeteer via npm with

npm install puppeteer
Enter fullscreen mode Exit fullscreen mode

Scripting:

Now that your environment is ready, it's time to start coding! Create a new JavaScript file (let's call it "index.js")

touch index.js
Enter fullscreen mode Exit fullscreen mode


and import Puppeteer at the top of the file using

const puppeteer = request('puppeteer');
Enter fullscreen mode Exit fullscreen mode

Start Browser:

Next, you need to start a new browser with Puppeteer. This can be done with one line of code:

async function scrape(){
const browser = await puppeteer.launch();
}
Enter fullscreen mode Exit fullscreen mode


.
This will open a new instance of Chrome in headless mode (ie, with no visible browser window)

Navigating to a New web page:

If you have a web browser, you can navigate to any web page using Puppeteer's "newPage()" and "goto()" methods. Example:

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})

};
Enter fullscreen mode Exit fullscreen mode

Data Scraping:

Now comes the fun part - collecting the required data from the site. This may include selecting elements, extracting text or attributes, and saving the data to a file or database. Puppeteer provides several methods to interact with the page, such as the "evaluate()" method, which make scraping easy.

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we select the element using query selector
// for example to get the page title 
const title=document.title
return title
})
};
Enter fullscreen mode Exit fullscreen mode
async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class 
const posts=document.querySelectorAll('.posts')

return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
};
Enter fullscreen mode Exit fullscreen mode

Closing the browser:

When you are done collecting data, don't forget to close your browser to free up system resources. You can do this with the "browser.close()" method.

async function scrape(){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com',{
     waitUntill:"documentLoaded"
})
const data=await page.evaluate(()=>{
//we can select all the paragraphs within a particular class 
const posts=document.querySelectorAll('.posts')

return Array.from(posts).map((post)=>{
const text=post.querySelector("p.text").innerText
const author=post.querySelector("p.author").innerText
return{ text, author}
})
return title
})
await browser.close()
};

scrape().then((res)=>{
console.log(res)
}).catch((error)=>{
console.log(error)
})
Enter fullscreen mode Exit fullscreen mode

And there you have it - a basic overview of web scraping with Puppeteer and Node.js! Of course, you can do a lot more with Puppeteer, from taking screenshots and creating PDF files to testing and debugging web applications. But hopefully this guide has given you a solid foundation to explore the exciting world of online scraping. Good luck scraping!🕵️‍♂️✨.

Top comments (5)

Collapse
 
markuz899 profile image
Marco

I would add that I recommend dockerizing the browser in order to optimize server-side resources

Collapse
 
tomiloba2 profile image
Omojola Tomiloba David

That's something I did not know. I am quite new to containers and docker. I will implement this and see how it goes.
Thanks for the info

Collapse
 
kosm profile image
Kos-M

Nice post !
I have wrote a lib to distribute tasks to run to workers and i have as a use case distributed scraping.
Master node distribute urls to fetch and all connected workers start scraping and returning results back to master.

check it out if you are interest in this : github.com/queue-xec/master/tree/d...

❤️

Collapse
 
tomiloba2 profile image
Omojola Tomiloba David

Thanks a lot
And really nice work you have there

Collapse
 
bdmorin profile image
Brian

This is great, except use playwright