DEV Community


Asynchronous Web Scraping with Python, aiohttp, and asyncio

proxiesapi profile image Mohan Ganesan Originally published at ・2 min read

One of the features of our Rotating Proxy Service Proxies API is the enormous concurrency it offers straight from the free plan itself. This is because we wanted the ability to scale our customer's web scrapers to be there from moment one. People do ask us what the best way to make concurrent requests on their side is. So we thought we would put together a getting started code that helps you understand and get started with async requests in Python.

Async is a better option than multi-threading because it's challenging to write thread-safe code.

So here is an example of fetching multiple URLs simultaneously using aiohttp module.

We will try and fetch all the URLs from this array.
First, we will initialize everything by loading the modules we need.
We need a function to handle individual fetches.
Notice how we use the start_time is stored in an array for each URL, and the time taken is calculated and printed.

The fetch_urls calls the ensure_future function to make sure the URLs finish fetching.
The fetch_async sets up the event loops and uses the run_until_complete to wait till all the URL fetches are completed to pass the control back so we can print the total time taken.
Putting it all together
And when you run it.
And when you run it.
This will scale, but If you want to use this in production and want to scale to thousands of links, then you will find that you will get IP blocked quickly by many websites as well. In this scenario, using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

A simple API can access the whole thing like below in any programming language.

You dont even have to take the pain of loading Puppeteer as we render Javascript behind the scenes, and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases, you can just call the URL with render support like so.

Discussion (0)

Editor guide