Web scraping collects and extracts unstructured data from a website to a more readable structured format like JSON, CSV format, and more. Organizations set guiding principles on scraped endpoints that are permitted.
When scraping a website for personal use, it can be stressful to manually change the code every time, as most big brand websites want people to refrain from scraping their public data. The following restrictions or problems might arise, such as CAPTCHAs, user agent (allowed and disallowed endpoints) blocking, IP blocking, and proxy network setup are set.
A practical use case of web scraping is notifying users of price changes for an item on sites like Amazon, eBay, etc.
In this article, you will learn how to use Bright Data’s Scraping Browser to unlock websites at scale without being blocked because of its built-in unlocking capabilities.
Sandbox
Test and run the complete code in this Codesandbox.
Prerequisites
It would help if you had the following to complete this tutorial:
- Basic knowledge of JavaScript.
- Have Node installed on your local machine. It is required to install dependencies
- A code editor - VS Code
What is Bright Data?
Bright Data is a data collection or aggregation service with a massive network of internet protocols (IPs) and proxies to scrape information off a website, thereby having the resource to avoid detection by company bots that prevent data scraping.
In essence, Bright Data does the heavy lifting in the background because of its large datasets available on the platform, which removes the worry of being blocked or gaining access to website data.
What is a headless browser?
A headless browser is a browser that operates without a graphical user interface (GUI). Modern web browsers like Google, Safari, Brave, Mozilla, and so on; all have a graphical interface for interactivity and displaying visual content. For headless browsers, it functions in the background with scripts or in the command line interface (CLI) written by developers.
Using a headless browser for web scraping is essential because it allows you to extract data from any public website by simulating user behavior.
Headless browsers are suitable for the following:
- Automated testing
- Web scraping
Benefits of Puppeteer
Puppeteer is an example of a headless browser. The following are some of the benefits of using Puppeteer in web scraping:
- Crawl single-page application (SPA)
- Allows for automated testing of website code
- Clicking on pages elements
- Downloading data
- Generate screenshots and PDFs of pages
Installation
Create a new folder for this app, and run the command below to install a node server.
npm init -y
The command will initialize this project and create a package.json file containing all the dependencies and project information. The -y
flag accepts all the defaults upon initialization of the app.
With the initialization complete, let’s install the nodemon
dependency with this command:
npm install -D nodemon
Nodemon is a tool that will automatically restart the node application when the file changes.
In the package.json
, update the scripts object with this code:
package.json
{
...
"scripts": {
"start": "node index.js",
"start:dev": "nodemon index.js"
},
...
}
Next, create a file, index.js
, in the directory's root, which will be the entry point for writing the script.
The other package to install is the puppeteer-core
, the automation library without the browser used when connecting to a remote browser.
npm install puppeteer-core
Building with Bright Data’s Scraping Browser
Create an account on Bright Data to access all its services. But for this project, the focus would be on the Scraping Browser functionality.
On your admin dashboard, click on the Proxies and Scraping Infra.
Scroll to the bottom of the page and select the Scraping Browser. After that, click the Get started button from the proxy products listed.
On opening the tool, give the proxy a name and click the button, Add Proxy, and when prompted about creating a new zone, select Yes.
The next screen should be something like this, with the host, username, and password displayed.
Now, click on the button </> Check out code and integration examples and on the next screen, select Node.js as the language of choice for this app.
Creating environment variables
Environment variables are secret keys and credentials that should not be shared, hosted, or pushed to GitHub to prevent unauthorized access.
Before creating the .env
file in the root of the directory, let’s install this command:
npm install dotenv
Copy-paste this code to the .env
file, and replace the entire value in the quotation from your Access parameters tab:
.env
UNAME="<user-name>"
HOST="<host>"
Creating a web scraper using Puppeteer
Back to the entry point file, index.js
, copy-paste this code:
index.js
const puppeteer = require("puppeteer-core");
require("dotenv").config();
const auth = process.env.UNAME;
const host = process.env.HOST;
async function run() {
let browser;
try {
browser = await puppeteer.connect({
browserWSEndpoint: `wss://${auth}@${host}`,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(2 * 60 * 1000);
await page.goto("http://lumtest.com/myip.json");
const html = await page.content();
console.log(html);
} catch (e) {
console.error("run failed", e);
} finally {
await browser?.close();
}
}
if (require.main == module) run();
The code above does the following:
- Import the modules, the
puppeteer-core
, anddotenv
- Read the secret variables with the
host
andauth
variables - Define the asynchronous
run
function - In the try block, connect the endpoint with
puppeteer
in the object using the keybrowserWSEndpoint
- The browser page launches programmatically to access the different pages like elements and fire up events
- Since this is an asynchronous method, the
setDefaultNavigationTimeout
sets a navigation timeout for 2 minutes - Navigate to the page using the
goto
function, and afterward, get the URL's content with thepage.content()
method - It is compulsory that after scraping the web, you must close it in the
finally
block
If you want to expand this project, you can take screenshots of the web pages in png or pdf format.
Check out the documentation to learn more.
Conclusion
Scraping the web with Bright Data infrastructure makes the process quicker for your use case without writing your scripts from scratch, as it is already taken care of for you.
Try it today to explore the benefits of Bright Data over traditional web scraping tools, restricted by proxy networks and make it challenging to work with large datasets.
Resources
Scraping Browser documentation
Scrape at scale with Bright Data Scraping Browser
Top comments (2)
I am using puppeteer library to scrape the data from url of website.I got the scraped data but it is in improper format. I need to convert this scraped data into relevant question and answers format in nextjs project.
I want to convert scraped data into relevant question and answers format in nextjs project.
Note:Web scraping process. When type any url in textfield i need scraped data with question and answer format.
Hi Mohanraj,
I am working on something to simplify the result of the scraped data in JSON format.
Thanks for your concern and would work something out.