Jordan Hansen

Posted on Sep 9, 2019 • Originally published at javascriptwebscrapingguy.com on Sep 9, 2019

Jordan Uses ScraperApi

#webscraping #javascript #scraperapi #typescript

Disclosure:I do receive an affiliate commission for everyone who pays for Scraper Api using the link I provide here. Using this promo code also gives a 10% discount – SCRAPE187707

When I first started web scraping I remember expecting there to be a lot of difficulty accessing a lot of the world’s internet. I remember a client asking me to scrape google results and when I saw it was against their terms of service I assumed they would take drastic measures in preventing it. I expected those drastic measures to be things like blocking my IP address completely. I specifically remember thinking how would my Android phone work if my IP was completely blocked from google services.

Then when I got into scraping Amazon I expected something similar. I remember even mentioning in a post talking to my wife, mostly jokingly, about how it would be to be completely banned from Amazon services.

The reality

The reality is, no one wants to block you. You are their customer. It’s messy and bad business to start blocking their customers. They employ a lot of other techniques first, like recaptchas and timing out requests.

Sometimes those things can stop you from attaining the data you want. Google is pretty diligent about doing these things and so I went and intentionally triggered their rate limiting. See below.

429 status means too many requests.

Scraper Api

Here is where Scraper Api comes in to play. Scraper Api is a really, really neat tool that handles all the proxying and even CAPTCHAs for you. It’s incredibly easy to use. All you do is use the api key and url they provide and pass it the url you want to visit. Then it does the rest. I used the below requests right after the above ones where I was getting 429s. You can see that almost all of the responses were 200s after the fact.

That’s a lot better

Speed test

Isn’t this a neat gif?

Any time you are working with proxies you are going to have a speed hit and working with Scraper Api was no exception. Here are some of the results with and without Scraper Api:

Without Scraper Api

With Scraper Api

Google without Scraper Api

Google with Scraper Api

As you can see, when hitting this website it about four times longer than normal with Scraper Api. Google was a lot worse. I think that’s mostly because Scraper Api rotates proxies for about 60 seconds. So if it gets a problem at all, it’ll start rotating proxies until successful. That means that for those errors it took a full 60 seconds. Ouch. The good news is, Scraper Api doesn’t bill for any failed attempt.

Using Scraper Api in Practice

Because of the significant speed hit, I have to recommend using Scraper Api second, after the conventional methods. Here’s an example of how I think good code could work with Scraper Api:

const terms = [
    'pizza',
    'sword',
    'javascript web scraping guy',
    'burrito',
    'chocolate',
    'donuts',
    'why am I hungry',
    'when is breakfast'
];

// Scraper Api limits how many concurrent requests we have at once
const agent = new Agent({
    maxSockets: 25
});

scrape('pizza', agent);

function scrape(term: string, agent: Agent) {
    const url = `https://google.com/search?q=${term}`;
    const scraperApiUrl = `http://api.scraperapi.com?api_key=${process.env.apiKey}&url=`

    axios.get(url).then((response) => {
        const html = response.data.content ? response.data.content : response.data;
        const $ = cheerio.load(html);
        const title = $('title').text();
        console.log('Success: ', response ? response.status : 'no status', title);
    }, (error) => {
        if (error.response) {
            console.log(' Error getting response', term, error.response ? error.response.status : 'no response');
            if (error.response && error.response.status === 404) {
                return `404 for ${url}`;
            }
        }
        else if (error.request) {
            console.log(' Error getting request', term, error.request.message);
        }
        else {
            console.log(' Some other error', error.message);
        }

        // Use Scraper Api with maxSockets
        axios({
            url: scraperApiUrl + url,
            method: 'GET',
            httpAgent: agent,
            httpsAgent: agent
        }).then((response) => {
            const html = response.data.content ? response.data.content : response.data;
            const $ = cheerio.load(html);
            const title = $('title').text();
            console.log('Success: ', response ? response.status : 'no status', title);
        }, (error) => {
            console.log(`Failed getting ${url} with Scraper Api`, error);
        });
    });

}

The main takeaways here are there if an error happens, we check to see if it’s a 404. A 404 is more of an intentional error and means the page actually doesn’t exist. We don’t need Scraper Api for that.

If it’s not a 404, we then proceed to try the request again with Scraper Api. Scraper Api does limit how many concurrent requests you are allowed to have going at once. This is where the agent comes in. At the top you can see that I create a new Agent with a limited number of maxSockets. The Node.js documentation states that this “Determines how many concurrent sockets the agent can have open per origin.” So you can manage your concurrent requests here and prevent any problems from Scraper Api with the concurrent requests.

The end

So, there you have it. Scraper Api is a powerful, extremely easy to use tool in your scraping tool box. Use this promo code for a 10% discount –

SCRAPE187707. They also have a 1000 free api calls!

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!

DEV Community

Jordan Uses ScraperApi

The reality

Scraper Api

Speed test

Using Scraper Api in Practice

The end

Looking for business leads?

Top comments (0)

Read next

Testing with Playwright: Use i18next Translations in Tests, but not `t('key')`

Taking LLMs to (code) town - part II. Creating a vanilla.js web component toolchain from ground up

Melakukan pengujian Express.js menggunakan node-plug

ReddAPI: The Easiest Way to Work with Reddit’s API! 🚀 by SeasonedCode