One of the main questions I see on forums and reddit with regards to web scraping is…”how do I avoid being blocked?”. This is a problem that I certainly have had to address and the best solution to avoid being blocked is puppeteer and some of the great tools in puppeteer-extra. I also feel that it is important to mention how any web scraping should be done with care. While I feel that anything public is fine to web scrape, you shouldn’t be doing anything that puts undue burden on the target site. Feel free to take a look at the post I wrote on ethical web scraping.
Officially this is going to be part of the Learn to Web Scrape series but this isn’t targeted towards beginners. While I don’t feel it is very difficult to start using the puppeteer-extra plugins, I’m not going to go into the depth that a complete beginner to programming would need.
To the trials!
We are going to use Zillow as a test target today. I have a simple bit of puppeteer code visiting a random address in Ohio on Zillow. I perform the action five times, waiting 1.5 seconds between each new attempt. Check the code:
const browser = await puppeteer.launch({ headless: false });
const url = 'https://www.zillow.com/homes/%0913905--ROYAL-BOULEVARD-cleveland-ohio_rb/33601155_zpid/';
for (let i = 0; i < 5; i++) {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(1500);
await page.close();
}
await browser.close();
I was blocked on my third attempt. Zillow let me visit the page twice and then:
Ouch. That is some pretty impressive and swift blocking. I tried to add a humanish user agent.
page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
Two visits and then blocked again. Good for Zillow. I honestly applaud websites taking measures to slow down behavior they don’t want. The more friction there is, the less likely people are to want to deal with web scraping it.
Stealth mode
It’s time for the great stuff. Berstend has made some really powerful tools that come with something called puppeteer-extra. There is a large list of the tools here, with some cool ones like adblocker, flash, and….stealth.
It’s extremely easy to setup. We import the packages with require since there aren’t typescript definition files yet.
const puppeteerExtra = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
Then, we just setup puppeteer from puppeteer extra.
puppeteerExtra.use(pluginStealth());
const browser = await puppeteerExtra.launch({ headless: false });
// Normal browser from normal puppeteer
// const browser = await puppeteer.launch({ headless: false });
const url = 'https://www.zillow.com/homes/%0913905--ROYAL-BOULEVARD-cleveland-ohio_rb/33601155_zpid/';
for (let i = 0; i < 5; i++) {
console.log('starting attempt:', i);
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(1500);
await page.close();
}
await browser.close();
Now, back to Zillow. Out of my five attempts…none were blocked. Let’s try 20.
20 atttempts. No recaptchas. That easy. It’s THE best package and tool I’ve seen to avoid getting blocked while web scraping with puppeteer or any package for that matter.
Now, let’s try with 100 attempts. Eventually Zillow catches the stealth plugin and throws a recaptcha.
So, avoiding recaptchas entirely isn’t quite possible. Let’s talk about recaptchas.
reCaptcha land
reCaptchas are tough to deal with but not impossible. Berstend comes to our rescue once again with puppeteer-extra-plugin-recaptcha. The thing about reCaptchas, though, is that they can’t really be beat with pure automation. At least, I haven’t found a way.
How this plugin works is it leverages services that beat reCaptchas. One of these services is 2Captcha (this is an affiliate link. But, I use this product myself and really like it. Easy to use, very inexpensive, and works great.). You have to pay to use it and the plugin uses this integration to beat reCaptchas. But it’s not a program doing it. It’s actual humans. As I did a little more investigation, it turns out 2Captcha hires people to break the reCaptchas.
So what it does (or at least, what I assume it does) is send the reCaptcha to 2Captcha and then someone solves it immediately and sends back the completed token. Here’s the code to handle the reCaptcha:
// Use the reCaptcha plugin
puppeteerExtra.use(
RecaptchaPlugin({
provider: { id: '2captcha', token: process.env.captchaToken },
visualFeedback: true // colorize reCAPTCHAs (violet = detected, green = solved)
})
);
You’ll get your captchaToken from 2Captcha and place it there. In this package I’m using a .env
file. I’ve included a .sample.env
file to which you can add a token and just rename to .env
.
// Handle the reCaptcha
await page.goto(url);
try {
await page.waitForSelector('.error-content-block', { timeout: 750 });
await page.waitFor(5000);
await (<any>page).solveRecaptchas();
await Promise.all([
page.waitForNavigation(),
page.click('[type="submit"]')
]);
console.log('we found a recaptcha on attempt:', i);
}
catch (e) {
console.log('no recaptcha found');
}
Bam, this is all. Now when it pops up, it finds that the reCaptcha is there and then solves it. Easy. I was going to record a gif of it being solved but once I did it once it must have flagged my IP as good because it now hardly ever prompts me to solve reCaptchas. I started another 100 attempt check WITHOUT the stealth plugin and it didn’t prompt to solve a recaptcha until attempt number 75 and then it solved it and continued on.
Pretty awesome, right?
Conclusion
Star of the show is the puppeteer-extra. Combo that with its stealth plugins and its recaptcha plugin and 2Captcha and you can avoid, or handle, almost any blocking. Happy scraping!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Avoid being blocked with puppeteer appeared first on JavaScript Web Scraping Guy.
Top comments (1)
Hi,
I am using Puppeteer library in NodeJS for runtime PDF file generation. It works fine on my local system, but when I deploy my app on a cPanel Based CentOs Os server, it throws an error. Any solution would be appreciated.