Petr Pátek for Apify

Posted on Aug 13, 2020 • Originally published at blog.apify.com

Bypassing web scraping protection: get the most out of your proxies with shared IP address emulation

#webscraping #dataextraction #apify #automation

Learn about modern web scraping protection techniques and how to bypass them. Scrape up to three times more pages by combining IP address rotation with shared IP address emulation.

Web scraping is used everywhere. From e-commerce to automotive, industries are collecting valuable data from the web to get ahead of competition. But as web scraping grows in popularity and accessibility, websites employ ever more sophisticated techniques to block the robots.

We compare the effectiveness of plain IP address rotation and shared IP address emulation (aka session multiplexing) at bypassing the protections of Alibaba, Google and Amazon–sites notoriously protective of their data.

Our results show that shared IP address emulation can help you bypass blocking and significantly extend the efficiency of your proxies.

What is shared IP address emulation?

Emulating shared IP address sessions relies on websites knowing that many different users can be behind a single IP address. Requests from mobile phones, for example, are usually routed through only a few IP addresses. Meanwhile, users protected by a single corporate firewall may all be using the same IP address.

You can trick websites into limiting their blocking by emulating these user sessions. Shared IP address emulation relies on managing the requests you send to websites by using cookies, authentication tokens and browser HTTP signatures that make the requests look like they’re coming from multiple users routed through the same IP address.

Evaluation of shared IP address emulation

In this test, we ran a simple scraper that extracts a web page’s title and search result titles on randomly generated Alibaba, Google and Amazon search pages. Each run was performed using a new, free Apify account, which is allocated 30 random datacenter proxies from a shared pool.

We scraped each site first using only IP rotation and then with a fresh account using shared IP address emulation. Scraping with shared IP address emulation allowed us to scrape between two and three times more pages before being blocked.

Shared IP address emulation made simple with Apify SDK’s SessionPool

The open-source Apify SDK library for Node.js provides a toolbox for web scraping, crawling and web automation tasks. Its built-in SessionPool class enables shared IP address emulation with a few simple configuration parameters and method calls. It is easily pluggable into parts of the Apify ecosystem such as the Apify Proxy and actors but can also be used separately.

The code example below shows how you can create a simple crawler that uses the Apify Proxy and shared IP address emulation with the Apify SDK. The crawler recursively crawls the Apify domain, saving the title of each page it visits.

	const Apify = require('apify');
	const { utils: { log } } = Apify;

	Apify.main(async () => {
	// Create a RequestQueue
	const requestQueue = await Apify.openRequestQueue();
	// Define the starting URL
	await requestQueue.addRequest({ url: 'https://apify.com' });

	// Create a crawler
	const crawler = new Apify.CheerioCrawler({
	requestQueue,
	// To use the proxy IP session rotation logic, you must turn the proxy usage on.
	proxyConfiguration: await Apify.createProxyConfiguration(),
	// Activates the Session pool.
	useSessionPool: true,
	// Overrides default Session pool configuration.
	sessionPoolOptions: {
	maxPoolSize: 100 // Number of unique sessions.
	},
	persistCookiesPerSession: true,
	handlePageFunction: async ({ request, $, session }) => {
	const title = $("title").text();
	log.info(`Processing page: ${title}`)
	// Enqueue all links on current page
	await Apify.utils.enqueueLinks({
	baseUrl: "https://apify.com",
	$,
	requestQueue,
	});

	// Save title and url to the Apify dataset
	await Apify.pushData({
	title,
	url: request.url
	})
	}
	});

	// Run the crawler
	await crawler.run();
	});

view raw session-pool-example.js hosted with ❤ by GitHub

The example uses CheerioCrawler, Apify’s framework for the parallel crawling of web pages using plain HTTP requests and the cheerio HTML parser. Cheerio is a fast, flexible and lean implementation of core jQuery designed specifically for the server. It parses markup and provides an API for traversing and manipulating the resulting data structure.

The resulting crawler is extremely efficient.

Conclusion

Implementing shared IP address emulation with Apify SDK’s SessionPool is an easy task that can significantly reduce blocking when web scraping. It can reduce your proxy costs or simply allow you to scrape more pages.

Would you like to learn more about the Apify SDK? Check out this guide on getting started with Apify.

Feel free to let us know in the comments how this approach works for you!

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

DEV Community

Bypassing web scraping protection: get the most out of your proxies with shared IP address emulation

What is shared IP address emulation?

Evaluation of shared IP address emulation

Shared IP address emulation made simple with Apify SDK’s SessionPool

Conclusion

Simplify your DevOps and maximize your time.

Top comments (0)

Read next

Configure GitHub for Dev.to Publishing

🚀 Automating OTP Login with a JavaScript Browser Extension

Getting Started with Ansible: A Complete Guide to IT Automation

Global Navigation + Gradient Path Planning

Okay