DEV Community: Ander Rodriguez

How To Rotate Proxies in Python

Ander Rodriguez — Wed, 08 Jun 2022 14:06:43 +0000

A proxy can hide your IP, but what happens when that gets banned? You would need a new IP. Or you could maintain a list of them and rotate proxies for each request. The final option would be to use Smart Rotating Proxies, more on that later.

For now, we will focus on building our custom proxy rotator. We will start from a list of regular proxies, check them to mark the working ones, and provide simple monitoring to remove the failing ones from the working list. The provided examples use Python, but the idea will work in any language.

Let's dive in!

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install aiohttp

Proxy List

You might not have a proxy provider with a list of domain+ports. Do not worry, we'll see how to get one.

There are several lists of free proxies online. For the demo, grab one of those and save its content (just the URLs) in a text file (rotating_proxies_list.txt). Or use the ones below.

Free proxies aren't reliable, and the ones below probably won't work for you. They are usually short-lived.

167.71.230.124:8080
192.155.107.211:1080
77.238.79.111:8080
167.71.5.83:3128
195.189.123.213:3128
8.210.83.33:80
80.48.119.28:8080
152.0.209.175:8080
187.217.54.84:80
169.57.1.85:8123

Then, we will read that file and create an array with all the proxies. Read the file, strip empty spaces, and split each line. Careful when saving the file since we won't perform any sanity checks for valid IP:port strings. We'll keep it simple.

proxies_list = open("rotating_proxies_list.txt",
                    "r").read().strip().split("\n")

Check Proxies

Let's assume that we want to run the scrapers at scale. The demo is simplified, but the idea would be to store the proxies and their "health state" in a reliable medium like a database. We will use in-memory data structures that disappear after each run, but you get the idea.

First, let's write a simple function to check that the proxy works. For that, call ident.me, which will return the IP. It is a simple page that fits our use case. We will use asyncio and aiohttp, an "Asynchronous HTTP Client/Server" similar to the famous requests. It suits us better since its purpose is to work asynchronously, and it will help us when checking several proxies simultaneously.

For the moment, it takes an item from the proxies list and calls the provided URL. Most of the code is boilerplate that will soon prove useful. There are two possible results:

😊 If everything goes OK, it prints the response's content and status code (i.e., 200), which will probably be the proxy's IP.
😔 An error gets printed due to timeout or some other reason. It usually means that the proxy is not available o cannot process the request. Many of these will appear when using free proxies.

import aiohttp
import asyncio

proxies_list = open("rotating_proxies_list.txt", "r").read().strip().split("\n")
timeout = aiohttp.ClientTimeout(total=30)

async def get(url, session, proxy):
    try:
        async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
            print(response.status, await response.text())
    except Exception as e:
        print(e)

async def check_proxies():
    proxy = proxies_list.pop()
    async with aiohttp.ClientSession() as session:
        await get("http://ident.me/", session, proxy=proxy)

asyncio.run(check_proxies())

We intentionally use HTTP instead of HTTPS because many free proxies don't support SSL.

Add More Checks To Validate the Results

An exception means that the request failed, but there are other options that we should check, such as status codes. We will consider valid only specific codes and mark the rest as errors. The list is not an exhaustive one, adjust it to your needs. You might think, for example, that 404 "Not Found" isn't valid and should be tested again.

We could also add other checks, like validating that the response contains an IP address.

VALID_STATUSES = [200, 301, 302, 307, 404]

async def get(url, session, proxy):
    try:
        async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
            if response.status in VALID_STATUSES: # valid proxy
                print(response.status, await response.text())
            else:
                print(response.status)
    except Exception as e:
        print('Exception: ', type(e))

Iterate Over All the Proxies

Great! We now need to run the checks for each proxy in the array. We will loop over the proxy list calling get just as before. But instead of doing it sequentially, we will use asyncio.gather to launch all the requests and wait for them to finish. Async makes the code more complicated, but it speeds up web scraping.

The list is hardcoded to get a maximum of 10 items for security, to avoid hundreds of involuntary requests.

async def check_proxies():
    proxies = proxies_list[0:10] # limited to 10 to avoid too many requests
    async with aiohttp.ClientSession() as session:
        tasks = [
            get("http://ident.me/", session, proxy=proxy)
            for proxy in proxies
        ]
        await asyncio.gather(*tasks, return_exceptions=True)

We should also limit the number of concurrent requests. We'll do it using Semaphore, an object that will acquire and release a lock. It will maintain an internal counter that allows only so many calls (10 in this case), thus creating a maximum concurrency.

We need to change how to call check_proxies too.

sem = asyncio.Semaphore(10)
# ...
async def get(url, session, proxy):
    async with sem:
        try:
            # ...

loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())

Separate Working Proxies From the Failed Ones

Examining an output log is far from ideal, isn't it? We should keep an internal state for the proxy list. We will separate them into three groups:

unchecked: unknown state, to be checked.
working: the last call using this proxy was successful.
not working: the last request failed.

It is easier to add or remove items from sets than arrays, and they come with the advantage of avoiding duplicates. We can move proxies between lists without worrying about having the same one twice. If it's present, it just won't be added. That will simplify our code: remove an item from a set and add it to another. To achieve that, we need to modify the proxy storage slightly.

Three sets will exist, one for each group seen above. The initial one, unchecked, will contain the proxies from the file. A set can be initialized from an array, making it easy for us to create it.

proxies_list = open("rotating_proxies_list.txt", "r").read().strip().split("\n")
unchecked = set(proxies_list[0:10]) # limited to 10 to avoid too many requests
# unchecked = set(proxies_list)
working = set()
not_working = set()

# ...
async def check_proxies():
    async with aiohttp.ClientSession() as session:
        tasks = [
            get("http://ident.me/", session, proxy=proxy)
            for proxy in unchecked # use the new set for the loop
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
#...

Now, write helper functions to move proxies between states. One helper for each state. They will add the proxy to a set and remove it - if present - from the other two. Here is where sets come in handy since we don't need to worry about checking if the proxy is present or looping over the arrays. Call "discard" to remove if present or ignored, but no exception will raise.

For example, we will call set_working when a request is successful. And that function will remove the proxy from the unchecked or not working sets while adding it to the working set.

def reset_proxy(proxy):
    unchecked.add(proxy)
    working.discard(proxy)
    not_working.discard(proxy)

def set_working(proxy):
    unchecked.discard(proxy)
    working.add(proxy)
    not_working.discard(proxy)

def set_not_working(proxy):
    unchecked.discard(proxy)
    working.discard(proxy)
    not_working.add(proxy)

We are missing the crucial part! We need to edit get to call these functions after each request. set_working for the successful ones and set_not_working for the rest.

async def get(url, session, proxy):
    async with sem:
        try:
            async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
                if response.status in VALID_STATUSES: # valid proxy
                    set_working(proxy)
                else:
                    set_not_working(proxy)
        except Exception as e:
            set_not_working(proxy)

For the moment, add some traces at the end of the script to see if it's working well. The unchecked set should be empty since we run all the items. And those items will populate the other two sets. Hopefully working is not empty 😅 - it might happen with free proxies.

#...
loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())

print('unchecked ->', unchecked)
# unchecked -> set()
print('working ->', working)
# working -> {'152.0.209.175:8080', ...}
print('not_working ->', not_working)
# not_working -> {'167.71.5.83:3128', ...}

Using the Working Proxies

That was a straightforward way to check proxies but not truly useful yet. We now need a way to get the working proxies and use them for the real reason: web scraping actual content. We will create a function that will select a random proxy.

We included both working and unchecked proxies in our example, feel free to use only the working ones if it fits your needs. We will see later why the unchecked ones are present too.

random doesn't work with sets, so we'll convert them using a tuple.

import random

def get_random_proxy():
    # create a tuple from unchecked and working sets
    available_proxies = tuple(unchecked.union(working))
    if not available_proxies:
        raise Exception("no proxies available")
    return random.choice(available_proxies)

Next, we can edit the get function to use a random proxy if none is present. The proxy parameter is now optional. We will use that param to check the initial proxies, as we were doing before. But after that, we can forget about the proxy list and call get without it. A random one will be used and added to the not_working set in case of failure.

Since we will now want to get actual content, we need to return the response or raise the exception. With aiohttp, unlike requests, the response's content must await. Here is the final version.

async def get(url, session, proxy=None):
    if not proxy:
        proxy = get_random_proxy()

    async with sem:
        try:
            async with session.get(url, proxy=f"http://{proxy}", timeout=timeout) as response:
                if response.status in VALID_STATUSES:
                    set_working(proxy)
                else:
                    set_not_working(proxy)

                await response.text() # content needs to be "awaited"
                return response # return response
        except Exception as e:
            set_not_working(proxy)
            raise e # raise exception

Include below the script the content you want to scrape. We will merely call once again the same test URL for the demo.

The idea is to start from here to build a real-world scraper based on this backbone. And to scale it, store the items in persistent storage, such as a database (i.e., Redis).

#....
loop = asyncio.get_event_loop()
loop.run_until_complete(check_proxies())

# real scraping part comes here
async def main():
    async with aiohttp.ClientSession() as session:
        result = await get("http://ident.me/", session)
        print(result.ok) # True
        print(result.status) # 200
        print(await result.text()) # 152.0.209.175

asyncio.run(main())

What happens with false negatives or one-time errors? Once we send a proxy to the not_working set, it will remain there forever. There's no way back.

Re-Checking Not Working Proxies

We should re-check the failed proxies from time to time. There are many reasons: the failure was due to networking issues, a bug, or the proxy provider fixed it.

In any case, Python allows us to set Timers, "an action that should be run only after a certain amount of time has passed". There are different ways to achieve the same end, and this is simple enough to run it using three lines.

Remember the reset_proxy function? We didn't use it at all until now. We will set a Timer to run that function for every proxy marked as not working. Twenty seconds is a small number for a real-world case but enough for our test. We exclude a failing proxy and move it back to unchecked after some time.

And this is the reason to use both working and unchecked sets in get_random_proxy. Modify that function to use only working proxies for a more robust use case. And then, you can run check_proxies periodically, which will loop over the unchecked elements - in this case, failed proxies that remained some time in the sin bin.

from threading import Timer

def set_not_working(proxy):
    unchecked.discard(proxy)
    working.discard(proxy)
    not_working.add(proxy)

    # move to unchecked after a certain time (20s in the example)
    Timer(20.0, reset_proxy, [proxy]).start()

There is a final option for even more robust systems, but we'll leave the implementation up to you. Store analytics and usage for each proxy, for example, the number of times it failed and when was the last one. Using that info, adjust the time to re-check - longer times for proxies that failed several times. Or even set some alerts if the number of working proxies goes below a threshold.

Conclusion

Building a plain proxy rotator might seem doable for small scraping scripts, but it can grow painful. But, hey, you did it!! 👏

These are the steps we followed:

Store proxy list as plain text
Import from the file as an array
Check each of them
Separate the working ones
Check for failures while scraping and remove them from the working list
Re-check not working proxies from time to time

As a note of caution, do not rotate IP when scraping logged-in or any other kind of session/cookies.

If you don't want to worry about rotating proxies manually, you can always use ZenRows, a Web Scraping API that includes Smart Rotating Proxies. It works as a regular proxy - with a single URL - but provides different IPs for each request.

Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Speed Up Web Scraping with Concurrency in Python

Ander Rodriguez — Tue, 17 May 2022 15:09:59 +0000

Scraping websites for data is a typical use case for developers. Whether it's a side project or you're building a startup, there are many reasons to scrape the web.

For example, if you want to start a price comparison website, you'll need to scrape prices from various e-commerce sites. Maybe you want to build an AI that could identify products and look up their price on Amazon. The possibilities are endless.

But have you ever noticed how slow it is to get all the pages? Would you scrape all the products one after the other? There must be a better solution, right? Right?!

Scraping websites can be time-consuming because you have to deal with waiting for responses from the server and rate-limiting. That's why we will show you how to speed up your web scraping projects by using concurrency in Python.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4 aiohttp numpy

If you know the basics behind concurrency, skip the theory part and jump directly into the action.

Concurrency

Concurrency is a term that deals with the ability to run multiple computing tasks simultaneously.

When you make requests to websites sequentially, you send out one request at a time, waiting for it to return, and then sending out the following one.

However, you can send out many requests at once with concurrency, and work on them all as they return. The speed increases from this method are incredible. Compared to sequential requests, concurrent ones will be much faster regardless of whether they are running in parallel (multiple CPUs) or not - more on this later.

To understand the benefits of concurrency, we need to understand the difference between processing tasks sequentially and concurrently. For example, let's say we have five tasks that take 10 seconds each to complete.

When processing them sequentially, the time it takes to complete all five is 50 seconds. However, it takes only 10 seconds for all five tasks to complete when processing them concurrently.

In addition to increasing speed, concurrency allows us to do more work in less time by distributing our web scraping workload among several processes.

There are several ways to parallelize requests, such as multiprocessing and asyncio. From a web scraping perspective, we can use these libraries to parallelize requests to different websites or other pages on the same website. In this article, we will focus on asyncio, a Python module providing infrastructure for writing single-threaded concurrent code using coroutines.

Since concurrency implies more convoluted systems and code, consider if the pros outweigh the cons for your use case.

Benefits of Concurrency

More work done in less time.
Idle network time invested in other requests.

Dangers of Concurrency

Harder to develop and debug.
Race conditions.
The need to check and use thread-safe functions.
Block probabilities grow if not handled carefully.
Concurrency comes with a system overhead, set a reasonable concurrency level.
Involuntary DDoS if too many requests against a small site.

Why asyncio?

To decide what technology to use, we must understand the difference between asyncio and multiprocessing. And also I/O-bound and CPU-bound.

asyncio "is a library to write concurrent code using the async/await syntax". It runs on a single processor.

multiprocessing "is a package that supports spawning processes using an API [...] allowing the programmer to fully leverage multiple processors on a given machine". Each process will start its own Python interpreter in a different CPU.

I/O-bound means that the program will run slower due to input/output operations. In our case, mostly network requests.

CPU-bound means that the program will run slower due to central processor use - for example, math calculations.

Why does this affect the library we will use for concurrency? Because a big part of the cost for concurrency is creating and maintaining threads/processes. For CPU-bound problems, having many processes in different CPUs will pay off. But that might not be the case for I/O-bound scenarios.

Since scraping is mostly I/O-bound, we picked asyncio. But in case of doubt (or just for fun), you can replicate the idea using multiprocessing and compare the results.

Sequential Version

We'll start by scraping scrapeme.live as an example, a fake Pokémon e-commerce prepared for testing.

First, we will start with the sequential version of the scraper. Several snippets are part of all cases, so those will remain the same.

By visiting the page, we see that there are 48 pages. Since it is a testing environment, that won't change anytime soon. Our first constants will be the base URL and a range for the pages.

base_url = "https://scrapeme.live/shop/page"
pages = range(1, 49) # max page (48) + 1

Now, extract the basics from a product. For that, use requests.get to get the HTML and then BeautifulSoup to parse it. We will loop over each product and get some basic info from them. All the selectors come from a manual review of the content (using DevTools), but we won't go into detail here for brevity.

import requests 
from bs4 import BeautifulSoup 

def extract_details(page): 
    # concatenate page number to base URL 
    response = requests.get(f"{base_url}/{page}/") 
    soup = BeautifulSoup(response.text, "html.parser") 

    pokemon_list = [] 
    for pokemon in soup.select(".product"): # loop each product 
        pokemon_list.append({ 
            "id": pokemon.find(class_="add_to_cart_button").get("data-product_id"), 
            "name": pokemon.find("h2").text.strip(), 
            "price": pokemon.find(class_="price").text.strip(), 
            "url": pokemon.find(class_="woocommerce-loop-product__link").get("href"), 
        }) 
    return pokemon_list

The extract_details function will take a page number and concatenate that to create a URL with the base seen earlier. After getting the content and creating an array of products, return them. That means the returned values will be a list of dictionaries. It is an essential detail for later.

We need to run the function above for each page, get all the results, and store them.

import csv 

# modified to avoid running all the pages unintentionally 
pages = range(1, 3) 

def store_results(list_of_lists): 
    pokemon_list = sum(list_of_lists, []) # flatten lists 

    with open("pokemon.csv", "w") as pokemon_file: 
        # get dictionary keys for the CSV header 
        fieldnames = pokemon_list[0].keys() 
        file_writer = csv.DictWriter(pokemon_file, fieldnames=fieldnames) 
        file_writer.writeheader() 
        file_writer.writerows(pokemon_list) 

list_of_lists = [ 
    extract_details(page) 
    for page in pages 
] 
store_results(list_of_lists)

Running the code above will get two product pages, extract products (32 total), and store them in a CSV file called pokemon.csv. The store_results function does not affect the scraping in sequential or concurrent mode. You can skip it.

Since the results are lists, we must flatten them to allow writerows to do its job. That's why we named the variable list_of_lists (even if it's a bit weird), only to remind everyone that it's not flat.

Example of the output CSV file:

If you were to run the script for every page (48) total, it would generate a CSV with 755 products and spend around 30 seconds.

time python script.py 

real 0m31,806s 
user 0m1,936s 
sys 0m0,073s

Introducing asyncio

We know we can do better. If we perform all the requests at the same time, it should take much less, right? Maybe as long as the slowest request?

Concurrency should indeed run faster, but it also involves some overhead. So it is not a linear mathematical improvement. But improve we will.

For that, we will use the mentioned asyncio. It allows us to run several tasks on the same thread in an event loop (like Javascript does). It will run a function and switch the context to a different one when allowed. In our case, HTTP requests allow that switch.

We will start seeing an example that will sleep for a second. And the script should take a second to run. Notice that we cannot call main directly. We need to let asyncio know that it's an async function that needs executing.

import asyncio 

async def main(): 
    print("Hello ...") 
    await asyncio.sleep(1) 
    print("... World!") 

asyncio.run(main())

time python script.py 
Hello ... 
... World! 

real 0m1,054s 
user 0m0,045s 
sys 0m0,008s

Simple code in parallel

Next, we will expand an example case to run a hundred functions. Each of them will sleep for a second and print a text. It would take around one hundred seconds if we were to run them sequentially. With asyncio, it will take just one!

That's the power behind concurrency. As said earlier, for pure I/O-bound tasks, it will perform much faster - sleeping is not, but it counts for the example.

We need to create a helper function that will sleep for a second and print a message. Then, we edit main to call that function a hundred times and store each call in a tasks list. The last and crucial part is to execute and wait for all the tasks to finish. That's what asyncio.gather does.

import asyncio 

async def demo_function(i): 
    await asyncio.sleep(1) 
    print(f"Hello {i}") 

async def main(): 
    tasks = [ 
        demo_function(i) 
        for i in range(0, 100) 
    ] 
    await asyncio.gather(*tasks) 

asyncio.run(main())

As expected, a hundred messages and one second to execute. Perfect!

time python script.py 
Hello 0 
... 
Hello 99 

real 0m1,065s 
user 0m0,063s 
sys 0m0,000s

Scraping with asyncio

We need to apply that knowledge to scraping. The approach to follow will be to request concurrently and return product lists. Once all requests finish, store them. It might be better to save data after each request or in batches to avoid data losses for real-world cases.

Our first attempt won't have a concurrency limit, so be careful when using it. In the case of running it with thousands of URLs... well, it would perform all those requests almost at the same time. Which could cause a tremendous load on the server and probably fry your computer.

requests does not support async out-of-the-box, so we will use aiohttp to avoid complications. requests can do the job, and there is no substantial performance difference. But the code is more readable using aiohttp.

import asyncio 
import aiohttp 
from bs4 import BeautifulSoup 

async def extract_details(page, session): 
    # similar to requests.get but with a different syntax 
    async with session.get(f"{base_url}/{page}/") as response: 

        # notice that we must await the .text() function 
        soup = BeautifulSoup(await response.text(), "html.parser") 

        # [...] same as before 
        return pokemon_list 

async def main(): 
    # create an aiohttp session and pass it to each function execution 
    async with aiohttp.ClientSession() as session: 
        tasks = [ 
            extract_details(page, session) 
            for page in pages 
        ] 
        list_of_lists = await asyncio.gather(*tasks) 
        store_results(list_of_lists) 

asyncio.run(main())

The CSV file should have every product (755) just as before. Since we perform all the page calls at the same time, the results will not arrive in order. If we were to add the results to the file inside extract_details they might be unordered. Since we wait for all tasks to finish and then process them, the order will not be a problem.

time python script.py 

real 0m11,442s 
user 0m1,332s 
sys 0m0,060s

We did it! 3x faster is nice, but... shouldn't it be 40x? It's not that simple. Many things can affect the performance (Network, CPU, RAM, and so on).

And in this demo page, we've noticed that response time slows down when we perform several calls. It might be by design. Some servers/providers can limit the number of concurrent requests to avoid too much traffic from the same IP. It is not a block but more of a queue. You will get served, but after waiting a bit.

To see real speed-up, you can test against a delay page. It is another testing page that will wait for 2 seconds and then return a response.

base_url = "https://httpbin.org/delay/2" 
#... 

async def extract_details(page, session): 
    async with session.get(base_url) as response: 
        #...

Removed all the extracting and storing logic, just calling the delay URL 48 times. And it runs in under 3 seconds.

time python script.py 

real 0m2,865s 
user 0m0,245s 
sys 0m0,031s

Limiting Concurrency with Semaphore

As said earlier, we should limit the number of concurrent requests, especially against a single domain.

asyncio comes with Semaphore, an object that will acquire and release a lock. Its inner functionality will block some of the calls until the lock is acquired, thus creating a maximum concurrency.

We need to create the semaphore with the maximum we want. And then await on the extracting function until it is available using async with sem.

max_concurrency = 3 
sem = asyncio.Semaphore(max_concurrency) 

async def extract_details(page, session): 
    async with sem: # semaphore limits num of simultaneous downloads 
        async with session.get(f"{base_url}/{page}/") as response: 
            # ... 

async def main(): 
        # ... 

loop = asyncio.get_event_loop() 
loop.run_until_complete(main())

It gets the job done, and it is relatively easy to implement! Here is the output with max concurrency set to 3.

time python script.py 

real 0m13,062s 
user 0m1,455s 
sys 0m0,047s

It shows that the version with unlimited concurrency is not operating at its full speed 🤦. If we increment the limit to 10, the total time is similar to the unbound script.

Limiting concurrency with TCPConnector

aiohttp offers an alternative solution that offers further configuration. We can create the client session passing in a custom TCPConnector.

We can build it by using two parameters that suit our needs:

limit - "total number of simultaneous connections".
limit_per_host - "limit simultaneous connections to the same endpoint" (same host, port, and is_ssl).

max_concurrency = 10 
max_concurrency_per_host = 3 

async def main(): 
    connector = aiohttp.TCPConnector(limit=max_concurrency, limit_per_host=max_concurrency_per_host) 
    async with aiohttp.ClientSession(connector=connector) as session: 
        # ... 

asyncio.run(main())

Also easy to implement and maintain! Here is the output with max concurrency set to 3 per host.

time python script.py 

real 0m16,188s 
user 0m1,311s 
sys 0m0,065s

The advantage over Semaphore is the option to limit the total amount of concurrent calls and requests per domain. We could use the same session to scrape different sites, and each one of those would have its own limit.

The downside is that it looks a bit slower. Run some tests with more pages and actual data for a real-case scenario.

multiprocessing

Scraping is I/O-bound like we saw earlier. But what if we needed to mix it with some CPU-intensive computations? To test that case, we'll use a function that will count_a_lot (to one hundred million) after each scraped page. It is a simple (and silly) way to force a CPU to be busy for some time.

def count_a_lot(): 
    count_to = 100_000_000 
    counter = 0 
    while counter < count_to: 
        counter = counter + 1 

async def extract_details(page, session): 
    async with session.get(f"{base_url}/{page}/") as response: 
        # ... 
        count_a_lot() 
        return pokemon_list

For the asyncio version, just run it as before. It might take a long time ⏳.

time python script.py 

real 2m37,827s 
user 2m35,586s 
sys 0m0,244s

Now, brace for the hard part.

Adding multiprocessing is a bit harder. We need to create a ProcessPoolExecutor, which "uses a pool of processes to execute calls asynchronously". It will handle the creation and control of each process in a different CPU.

But it won't distribute the load. For that, we will use NumPy's array_split, which will slice the pages range into equal chunks according to the number of CPUs.

The rest of the main function is similar to the asyncio version but changes some syntax to match the multiprocessing style.

The essential difference is that we cannot call extract_details directly. We could, but we'll try to obtain the maximum power by mixing multiprocessing with asyncio.

from concurrent.futures import ProcessPoolExecutor 
from multiprocessing import cpu_count 
import numpy as np 

num_cores = cpu_count() # number of CPU cores 

def main(): 
    executor = ProcessPoolExecutor(max_workers=num_cores) 
    tasks = [ 
        executor.submit(asyncio_wrapper, pages_for_task) 
        for pages_for_task in np.array_split(pages, num_cores) 
    ] 
    doneTasks, _ = concurrent.futures.wait(tasks) 

    results = [ 
        item.result() 
        for item in doneTasks 
    ] 
    store_results(results) 

main()

Long story short, each CPU process will have a few pages to scrape. There are 48 pages, and assuming your machine has 8 CPUs, each process will request six pages (6 * 8 = 48).

And those six pages will run concurrently! After that, the calculations will have to wait since they are CPU-intensive. But we have many CPUs, so they should run faster than the pure asyncio version.

async def extract_details_task(pages_for_task): 
    async with aiohttp.ClientSession() as session: 
        tasks = [ 
            extract_details(page, session) 
            for page in pages_for_task 
        ] 
        list_of_lists = await asyncio.gather(*tasks) 
        return sum(list_of_lists, []) 


def asyncio_wrapper(pages_for_task): 
    return asyncio.run(extract_details_task(pages_for_task))

This ☝️ is where the magic happens. Each CPU process will start an asyncio with a subset of the pages (e.g., from 1 to 6 for the first one).

And then, each one of those will call several URLs, using the already known extract_details function.

Take a moment to assimilate that. The whole process goes like this:

Create the executor.
Split the pages.
Start asyncio per each process.
Create an aiohttp session and create the tasks of a subset of pages.
Extract data for each page.
Consolidate and store the results.

And here are the execution times. We haven't mentioned it, but the user time here plays a notable role. For the script running only asyncio:

time python script.py 

real 2m37,827s 
user 2m35,586s 
sys 0m0,244s

The version with asyncio and multiple processes:

time python script.py 

real 0m38,048s 
user 3m3,147s 
sys 0m0,532s

Did you spot the difference? The first one took more than two minutes, and the second one 40 seconds. But in total CPU time (user time) the second one was over three minutes! That is a bit more due to system overhead and all that.

That shows that the parallel processing "wastes" more time (in total) but finishes before. Then it is up to you to decide which method to choose. Take also into account that it is more complicated to develop and debug 😅.

Conclusion

We've seen that asyncio might be enough for scraping since most of the running time goes to networking. Which is I/O-bound and works well with concurrent processing in a single core.

That situation changes if the gathered data requires some CPU-intensive work. We've seen a silly example with counting, but you get the point.

For most cases, asyncio with aiohttp - better suited than requests for async work - gets the job done. Add a custom connector to limit the number of requests per domain and total concurrent ones. With those three pieces, you can start building a scraper that can scale.

An important piece is to allow new URLs/tasks (something like a Queue), but that's for another day. Stay tuned!

Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

HTTP Requests in Java with Proxies

Ander Rodriguez — Tue, 29 Mar 2022 07:51:12 +0000

Accessing data over HTTP is more common every day. Be it APIs or webpages, intercommunication between applications is growing. And website scraping.

There is no easy built-in solution to perform HTTP calls in Java. Many packages offer some related functionalities, but it's not easy to pick one. Especially if you need some extra features like connecting via authenticated proxies.

We'll go from the basic request to advanced features using fluent.Request, part of the Apache HttpComponents project.

Direct Request

The first step is to request the desired page. We will use httpbin for the demo. It shows headers and origin IP, allowing us to check if the request was successful.

We need to import Request, get the target page and extract the result as a string. The package provides methods for those cases and many more. Lastly, print the response.

import org.apache.hc.client5.http.fluent.Request;

public class TestRequest {
    public static void main(final String... args) throws Exception {
        String url = "http://httpbin.org/anything";

        String response = Request
                .get(url) // use GET HTTP method
                .execute() // perform the call
                .returnContent() // handle and return response
                .asString(); // convert response to string

        System.out.println(response);
    }
}

We are not handling the response nor checking for errors. It is a simplified version of a real-use case.

But we can see on the result that the request was successful, and our IP shows as the origin. We'll solve that in a moment.

Proxy Request

There are many reasons to add proxies to an HTTP request, such as security or anonymity. In any case, Java libraries (usually) make adding proxies complicated.

In our case, we can use viaProxy with the proxy URL as long as we don't need authentication. More on that later.

For now, we'll use a proxy from a free list. Note that these free proxies might not work for you. They are short-time lived.

import org.apache.hc.client5.http.fluent.Request;

public class TestRequest {
    public static void main(final String... args) throws Exception {
        String url = "http://httpbin.org/anything";
        String proxy = "http://169.57.1.85:8123"; // Free proxy

        String response = Request.get(url)
                .viaProxy(proxy) // will set the passed proxy
                .execute().returnContent().asString();

        System.out.println(response);
    }
}

Proxy with Authentication

Paid or private proxy providers - such as ZenRows - frequently use authentication in each call. Sometimes it is done via IP allowed lists, but it's frequent to use other means like Proxy-Authorization headers.

Calling the proxy without the proper auth method will result in an error: Exception in thread "main" org.apache.hc.client5.http.HttpResponseException: status code: 407, reason phrase: Proxy Authentication Required.

Following the example, we will need two things: auth and passing the proxy as a Host.

Proxy-Authorization contains the user and password base64 encoded.

Then, we need to change how viaProxy gets the proxy since it does not allow URLs with user and password. For that, we will create a new HttpHost passing in the whole URL. It will internally handle the problem and omit the unneeded parts.

import java.net.URI;
import java.util.Base64;

import org.apache.hc.client5.http.fluent.Request;
import org.apache.hc.core5.http.HttpHost;

public class TestRequest {
    public static void main(final String... args) throws Exception {
        String url = "http://httpbin.org/anything";
        URI proxyURI = new URI("http://YOUR_API_KEY:@proxy.zenrows.com:8001"); // Proxy URL as given by the provider
        String basicAuth = new String(
            Base64.getEncoder() // get the base64 encoder
            .encode(
                proxyURI.getUserInfo().getBytes() // get user and password from the proxy URL
            ));
        String response = Request.get(url)
                .addHeader("Proxy-Authorization", "Basic " + basicAuth) // add auth
                .viaProxy(HttpHost.create(proxyURI)) // will set the passed proxy as a host
                .execute().returnContent().asString();

        System.out.println(response);
    }
}

Ignore SSL Certificates

When adding proxies to SSL (https) connections, libraries tend to raise a warning/error about the certificate. From a security perspective, that is awesome! We avoid being shown or redirected to sites we prefer to avoid.

But what about forcing our connections through our own proxies? There is no security risk in those cases, so we want to ignore those warnings. That is, again, not an easy task in Java.

The error goes something like this: Exception in thread "main" javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.

For this case, we will modify the target URL by switching it to https. And also, call a helper method that we'll create next. Nothing else changes on the main function.

public class TestRequest {
    public static void main(final String... args) throws Exception {
        ignoreCertWarning(); // new method that will ignore certificate warnings

        String url = "https://httpbin.org/anything"; // switch to https
        // ...
    }
}

Now to the complicated and verbose part. We need to create an SSL context and fake certificates. As you can see, the certificates manager and its methods do nothing. It will just bypass the inner working and thus avoid the problems. Lastly, initialize the context with the created fake certs and set it as default. And we are good to go!

import java.security.cert.X509Certificate;
import javax.net.ssl.*;

public class TestRequest {
    // ...
    private static void ignoreCertWarning() {
        SSLContext ctx = null;
        TrustManager[] trustAllCerts = new X509TrustManager[] { new X509TrustManager() {
            public X509Certificate[] getAcceptedIssuers() {return null;}
            public void checkClientTrusted(X509Certificate[] certs, String authType) {}
            public void checkServerTrusted(X509Certificate[] certs, String authType) {}
        } };

        try {
            ctx = SSLContext.getInstance("SSL");
            ctx.init(null, trustAllCerts, null);
            SSLContext.setDefault(ctx);
        } catch (Exception e) {}
    }
}

Conclusion

Accessing data (or scraping) in Java can get complicated and verbose. But with the right tools and libraries, we got to tame its verbosity - but for the certificate.

We might get back to this topic in the future. The HttpComponents library offers attractive functionalities such as async and multi-threaded execution.

Web Scraping with Python 101

Ander Rodriguez — Wed, 19 Jan 2022 12:29:11 +0000

The Internet is a vast data source if you know where to look - and how to extract it! Going page by page and copying the data manually is not an option anymore. And yet, many people still are doing that.

It would be great if our favorite data source were to expose all of it in a convenient format such as CSV. We wish it were so.

What if we told you that there is a solution? We are talking about Website Scraping. It allows you to extract structured information from any source available on the Internet. You can choose the data you will get and how to store it. And it is repeatable, meaning that you can run the same script every day - hour, or whatever- to get new items.

Continue with us to learn how to code your own web scraper taking a job board as an example target. We will build step-by-step a web scraping project with Python.

Curious? Let's dive in! 🤿

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4 pandas

Video Tutorial

If you prefer video content, watch this video from our Youtube channel.

What is Web Scraping

Web scraping consists of extracting data from websites. We could do it manually, but scraping generally refers to the automated way: software - usually called bot or crawler - visits web pages and gets the content we are after.

The easier way to access data is via API (Application Programming Interface). The downside is that not many sites offer it or allow limited access. For those sites without API, extracting data directly from the content - scraping - is a viable solution.

As we'll see later, many websites have countermeasures in place to avoid massive scraping. Do not worry for the moment; it should work for a few requests.

Web Scraping Process Summarized

The most famous crawlers are search engines, like Google, which visit and index almost the whole Internet. We'd better start small, so we'll begin from the basics and build upon that. These are the four main parts that form website scraping:

Inspect the Target Site Get a general idea of what data you can extract and how the page organizes it.
Obtain the HTML Access the content by downloading the page's HTML. We will focus on static content for simplicity.
Extract data Obtain the information you are after, usually a piece of data or a list of repeated items (i.e., a job offer or job listings).
Store extracted data Once extracted, transform and store the data for its use (i.e., save to a CSV file or insert in a database).

Basic Structure of a Web Page

Let's zoom out for a second and go back to the fundamentals - sorry, a bit of theory is coming. You can skip the following two sections if you are already familiar with the basics.

A web page consists mainly of HTML, CSS, Javascript (JS), and images. Bear with us for a second if you don't understand parts of that sentence.

A browser, like Chrome, opens a connection via the Internet to your favorite website using the HTTP protocol. The most common request type is GET, which usually retrieves information without modifying it. Then the server processes it and sends back an HTML (HyperText Markup Language) response. Simplifying, HTML is a text file with a syntax that will tell the browser what content to paint, text to show, and what resources to download.

In those extra resources are the ones mentioned above:

CSS will format and style the content (i.e., colors, fonts, and many more).
Javascript adds functionality and behavior, such as loading more job offers on infinite scroll.
Images... well, you get the point. They can be used as part of the main content or as backgrounds.

The browser will handle all the requests/responses and render the final content. Everything shows the style defined by the CSS file. Thanks to the behavior in the JS files, the infinite scroll works perfectly. Images are loaded and displayed where they should. The browser is really doing a lot of work, but usually so fast that we don't even notice as users.

However, the critical part for web scraping is the initial HTML. Some pages will load content later, but we will focus - for clarity - on those that load everything initially. That difference is usually called static vs. dynamic pages.

If your case involves dynamic pages, you can go to our article on scraping with Selenium, a headless browser. In short, it launches a real browser to access the target webpage. But it is programmatically controlled, so you can extract the content you desire.

What is HTML

HTML "is the standard markup language for documents designed to be displayed in a web browser." It will structure the page using tags, each one meaning something different to the browser. For example, <strong> will show text in bold, and <i> will do so in italics.

Other components will control what can be done and not the display format. Examples of that are <form> and <input>, which allow us to fill and send forms to the server, such as logging in and registering.

HTML elements might have attributes such as class or id, which are name-value pairs, separated by =. They are optional but quite common, especially classes. CSS uses them for styling and Javascript for adding interactivity. Some of them are directly associated with a tag, like href with <a> - URL of the link tag.

An example: <a id="example" class="link-style" href="http://example.org/">Go to example.org</a>

The tag <a> tells us that it is a link.
id="example" gives the element a unique ID. Browsers also use this for internal navigation as anchors.
The CSS processor will interpret class="link-style", and it might show in a different color if so defined in the CSS file.
href="http://example.org/" is the link's destination in case the user clicks the element.
Go to example.org is the text that the browser will show. Depending on the browser and CSS file, it might have some default style like a blue color or underline.

How to Obtain HTML using Requests

Now that we've seen the basics, let's use Python and the Requests library to download a page.

We will start by importing the library and defining a variable with the URL we want to access. Then, use the get function to obtain the page and print the response. It will be an object with the response's data: the HTML and other essential pieces such as status code.

import requests

url = "http://example.org/"
response = requests.get(url)
print(response) # <Response [200]>
print(response.status_code) # 200
print(response.text) # <!doctype html>...

We won't go into further detail on status codes, which indicate whether the request was successful. The ones in the 200 - 299 range mean success (sometimes denoted 2XX), 3XX indicates a redirection, 4XX client error, and 5XX server error. Don't worry, the responses should be 200 in our tests, and you will learn later what to do otherwise.

Parsing Data with BeautifulSoup

As seen in the response's text above, the data is there, but it is not easy to obtain the interesting bits. It is prepared to be consumed by a browser, not a human! To make it accessible, we will use BeautifulSoup, "a Python library for pulling data out of HTML."

It will allow us to get the data we want using the classes and IDs mentioned above. Following the example in the previous section, we will access the title (<h1>) and the link (<a>). We'll see how we know what tags to look for in a moment.

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

title = soup.find("h1")
print(title.text) # Example Domain

link = soup.find("a")
print(link.text) # More information...

What happens if we want to access several items? There are two paragraphs (<p>) in the example, but the find function would only get the first one. find_all will do precisely that, returning a list (ResultSet) instead of an element.

paragraphs = soup.find_all("p")
print(len(p)) # 2

print([p.text for p in paragraphs])
# ['This domain ...', 'More information...']

But we can access more than just text. For example, the link's target URL is in the element's href property, accessible with the get function.

print(link.get("href")) # https://www.iana.org/domains/example

So far, so good. But this is a simple example, and we accessed all the items by the tags. What about a more complex page with tens of different tags? We will probably need classes, IDs, and a new concept: nesting.

We will build a functional web scraper with an example site in the following sections. For now, a quick guide on the fundamental selectors:

soup.find("div") will get the first element that matches the div tag.
soup.find(id="header") looks for a node which ID is header. IDs are unique, so there shouldn't be more items with the same one.
soup.find(class_="my-class") returns the first item that contains the my-class class. Nodes can have multiple classes separated with spaces.
soup.find("div").find(id="header").find(class_="my-class") As you can see, it might get complicated fast. The example gets the first div, finds the header by ID, and an item with my-class inside that. It is called nesting and is common in HTML, where almost all items are nested inside a parent tag.

Now that we've got the basics, we can move on to the fun part.

Step 1: Explore Target with DevTools

Hold your horses 🐎! We know that coding is the fun part, but first, you'll need some understanding of the page you're trying to scrape. And not just the content but also the structure. We suggest browsing the target site for a few minutes with DevTools open (or any other tool).

For the rest of the examples, we will use remotive.io to demonstrate what can be done. Its homepage contains lists of job offers. We'll go step-by-step, getting the data available and structuring it.

To start exploring, go to the page on a new tab and open DevTolls. You can do it by pressing Control+Shift+I (Windows/Linux), Command+Option+I (Mac), or Right-click ➡ Inspect. Now go to the Elements tab [#1 on the image below], which will show the page's HTML. To inspect an element on the page, click on the select icon (to the left) [#2], and it will allow you to pick an item using the mouse [#3]. It will be highlighted and expanded on the Network tab [#4].

Once familiarized with DevTools, take a look, click elements, inspect the HTML, and understand how classes are used. Discover as much as possible and identify the essential parts (hint: look for the job entries with the class job-tile). We picked a well-structured site with classes that reflect the content, but it is not always the case.

Then explore also other page types, such as category or job offer. Maybe some details are essential to you but only available on the offer page. That would change the way to approach the coding. Or you realize that, as is the case, the structure for the category pages (i.e., Software Development) is the same as the homepage. That means we could scrape all the category pages for a bigger offer list! We won't do that for now, but we could do it.

Spoiler: it is not precisely the same since category pages are dynamically rendered. As mentioned above, headless browsers are needed for those cases.

Step 2: Obtain HTML with Requests

The first part of the code is the same as with example.org. Get the content using requests.get.

import requests

url = "https://remotive.io/"
response = requests.get(url)
print(response.status_code) # 200
print(response.text) # <!DOCTYPE html>

As mentioned earlier, many sites have anti-scraping software. The most basic action is blocking an IP with too many requests. Adding proxies to requests is simple, and they will hide your IP. This way, in case of being banned, your home/office address would be unaffected. But even better, some proxies add rotating power, which means that they assign a new IP to every request. That makes the banning part much more complicated, giving you room to scrape with fewer restrictions.

Tagging IPs is just the most common measure, but they might implement many more such as checking headers or geolocation. But for most sites, this approach should work fine.

import requests

url = "https://remotive.io/"
proxies={'https': 'http://api_key:@proxy.zenrows.com:8001'}
response = requests.get(url, proxies=proxies, verify=False)
print(response.text) # <!DOCTYPE html>

We are fetching just one URL, which is fine to begin. But it gets more complicated when trying to scrape at scale, so we will keep it sequential and straightforward at this point.

This step might look like the easiest one at first since requests handles it for us. But we mention the complications for you to be aware of 😅

Step 3: Extract Data with BeautifulSoup

This step is the one that changes most from case to case. Since it decides what data you will extract, it varies for each page type you want. For example, the selectors for the offer page won't work for the homepage. And suppose anything changes on the page (new fields, design change). In that case, you will probably need to modify the scraper to reflect those changes.

Following the job board example, the first thing we want to access is the job offer wrapper using ID (id="initial_job_list"), a job offer, and its title using class (class_="job-tile"). Note that it is "class_" and not "class" since it is a reserved word in Python.

import requests
from bs4 import BeautifulSoup

url = "https://remotive.io/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

jobs_wrapper = soup.find(id="initial_job_list")
job = jobs_wrapper.find(class_="job-tile")
job_title = job.find(class_="job-tile-title")
print(job_title.text) # Senior Front-End Engineer

See how we use the variable created in the previous line instead of soup? That means that the lookup will take place in the element - i.e., jobs_wrapper - and not the whole page. This is an example of the previously mentioned nesting but storing them in variables instead of concatenation.

There is a slight problem there, right? We only got ONE job offer! We need to change jobs_wrapper.find for jobs_wrapper.find_all to get all the job offers and then extract the data we want for each one. We will move that logic to a function to separate concerns.

#...
def extract_data(job):
    job_title = job.find(class_="job-tile-title")
    extra = job.find("p").text.split("·")
    return {
        "id": job.parent.get("id"),
        "title": job_title.text.strip(),
        "link": job_title.get("href"),
        "company": extra[0].strip(),
        "location": extra[1].strip(),
        "category": job.find(class_="remotive-tag-transparent").text,
    }

jobs_wrapper = soup.find(id="initial_job_list")
jobs = jobs_wrapper.find_all(class_="job-tile")
results = [extract_data(job) for job in jobs]

print(results)

# [
#   {
#     "id": "971825-software-dev",
#     "title": "Senior Front-End Engineer (JS/Graphics/3D)",
#     "link": "/remote-jobs/software-dev/senior-front-end-engineer-js-graphics-3d-971825",
#     "company": "Spline",
#     "location": "Remote (Anywhere)",
#     "category": "Software Development"
#   },
#   ...
# ]

BeautifulSoup also offers the select function, which takes a CSS selector and returns an array of matching elements.

jobs = jobs_wrapper.select(".job-tile")

For more ideas, check out our article on tricks when scraping. We explain that you can get relevant data from different sources such as hidden inputs, tables, or metadata. The selectors used here are very useful, but there are many alternatives.

And the same idea applies to extracting data, although we already moved it to the extract_data function. This way, if we were to change any selector because the paged changed, we would locate it quickly.

Step 4: Convert and Store Data with Pandas

We can read and operate with the extracted data. But for future analysis and usage, it is better to store it. In a real-world project, a database is usually the chosen option. We will save it in a CSV file for the demo.

Since CSV files and formats are beyond this tutorial, we will use the pandas library. The first thing is to create a DataFrame from the data we got. It accepts dictionaries usually without problems. Then, we call the helpful to_csv function to convert the dataframe to CSV and save it in a file.

import pandas as pd
# ...
data = pd.DataFrame(results)
data.to_csv("offers.csv", index=False)

As with any piece of software, we should follow good development practices. Single responsibility is one of them, and we should separate each of the steps above for maintainability. Storing CSV files is probably not a good option long-term. And, when the moment comes, it will be easier to replace if that part is separated from the extraction logic.

Next Steps

Celebrate now! You built your first scraper. 🎉

Error Handling

We said earlier that you should not worry about the error codes for the tutorial. But for a real-world project, there are a few things that your should consider adding:

Check response.ok. The library comes with a boolean that tells us if it went correctly.
Error handling for the critical parts (try/except).
Retry requests in case of error.

import requests
from bs4 import BeautifulSoup

url = "https://remotive.io/"
try:
    response = requests.get(url)
    if response.ok:
        # ...
    else:
        print(response) # status_code is 4XX or 5XX
except Exception as e:
    print(e)

Scaling Up

The following steps would be to scale the scrapers and automatize running them. We won't go deep into these topics, just a quick overview.

Following the example, two main things come to mind: scraping category pages instead of the homepage and getting more data from the job offer detail page. Depending on the intention behind the scraping, one might have more sense than the other.

We'll take the second one, scraping further details from each offer. We have already extracted the links in the previous steps to get them. We have to loop over them, get the link, and request the URL. Prepend the domain since it is a relative path. We simplified the data extraction part for brevity. The point is the same as above, extract the data you need using selectors.

Be careful when running the code below. We added a "trick" to slice the array and get only two job offer details. The idea is to avoid a hundred requests to the same host in seconds and make it faster.

results = results[:2] # trick to avoid a hundred requests

job_details = []
for job in results:
    response = requests.get("https://remotive.io" + job["link"])
    soup = BeautifulSoup(response.text, "html.parser")

    job_title = soup.find("h1")
    job_desc = soup.find(class_="job-description")
    job_meta = soup.find(id="job-meta-data")
    job_details.append({
        "title": job_title.text,
        # ...
    })

print(job_details)

# [
#     {
#         'title': 'Senior Front-End Engineer (JS/Graphics/3D)'
#     }, {
#         'title': 'Engineering Manager'
#     }, ...
# ]

Other Countermeasures

Many other problems appear when scraping, one of the most common being blocks. For privacy or security, some sites will not allow access to users that requests too many pages. Or will show captchas to ensure that they are real users.

We can take many actions to avoid blocks. Still, as mentioned earlier, the most effective one is using smart proxies that will change your IP in every request. If you are interested or have any problems, check out our article on avoiding detection.

Conclusion

We'd like you to part with the four main steps:

Explore the target site before coding
Retrieve the content (HTML)
Extract the data you need with selectors
Transform and store it for its use

If you leave with those points clear, we'll be more than happy 🥳

Web scraping with Python, or any other language/tool, is a long road. Try not to feel overwhelmed by the immensity of resources available. Scraping Instagram on your first day will probably not be possible since they are well-known blockers. Start with an easier target and gain some confidence.

Feel free to ask any follow-up questions or contact us.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

DOs and DON'Ts of Web Scraping

Ander Rodriguez — Tue, 21 Dec 2021 11:09:49 +0000

For those of you new to web scraping, regular users, or just curious: these tips are golden. Scraping might seem an easy-entry activity, and it is. But it will take you down a rabbit hole. Before you realize it, you got blocked from a website, your code is 110% spaghetti, and there's no way you can scale that to another four sites.

Ever been there? ✋ I was there 10 years ago — no shame (well, just a bit). Continue with us for a few minutes, and we'll help you navigate through the rabbit hole. 🕳️

DO Rotate IPs

The simplest and most common anti-scraping technique is to ban by IP. The server will show you the first pages, but it will detect too much traffic from the same IP and block it after some time. Then your scraper will be unusable. And you won't even be able to access the webpage from a real browser. The first lesson on web scraping is never to use your actual IP.

Every request leaves a trace, even if you try to avoid it from your code. There are some parts of the networking that you cannot control. But you can use a proxy to change your IP. The server will see an IP, but it won't be yours. The next step, rotate the IP or use a service that will do it for you. What does this even mean?

You can use a different IP every few seconds or per request. The target server can't identify your requests and won't block those IPs. You can build a massive list of proxies and take one randomly for every request. Or use a Rotating Proxy which will do that for you. Either way. The chances of your scraper working correctly skyrocketed with just this change.

import requests
import random

urls = ["http://ident.me"] # ... more URLs
proxy_list = [
    "54.37.160.88:1080",
    "18.222.22.12:3128",
    # ... more proxy IPs
]

for url in urls:
    proxy = random.choice(proxy_list)
    proxies = {"http": f"http://{proxy}", "https": f"http://{proxy}"}
    response = requests.get(url, proxies=proxies)
    print(response.text)
    # prints 54.37.160.88 (or any other proxy IP)

Note that these free proxies might not work for you. They are short-time lived.

DO Use Custom User-Agent

The second-most-common anti-scraping mechanism is User-Agent. UA is a header that browsers send in requests to identify themselves. They are usually a long string declaring the browser's name, version, platform, and many more. An example for an iPhone 13:

"Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1"

There is nothing wrong with sending a User-Agent, and it is actually recommended to do so. The problem is which one to send. Many HTTP clients send their own (cURL, requests in Python, or Axios in Javascript), which might be suspicious. Can you imagine your server getting hundreds of requests with a "curl/7.74.0" UA? You'd be skeptical at the very least.

The solution is usually finding valid UAs, like the one from the iPhone above, and using them. But it might turn against you also. Thousands of requests with exactly the same version in short periods?

So the next step is to have several valid and modern User-Agents and use those. And to keep the list updated. As with the IPs, rotate the UA in every request in your code.

# ... same as above 
user_agents = [ 
    "Mozilla/5.0 (iPhone ...", 
    "Mozilla/5.0 (Windows ...", 
    # ... more User-Agents 
] 

for url in urls: 
    proxy = random.choice(proxy_list) 
    proxies = {"http": f"http://{proxy}", "https": f"http://{proxy}"} 
    response = requests.get(url, proxies=proxies) 
    print(response.text)

DO Research Target Content

Take a look at the source code before starting development. Many websites offer more manageable ways to scrape data than CSS selectors. A standard method of exposing data is through rich snippets, for example, via Schema.org JSON or itemprop data attributes. Others use hidden inputs for internal purposes (i.e., IDs, categories, product code), and you can take advantage. There's more than meets the eye.

Some other sites rely on XHR requests after the first load to get the data. And it comes structured! For us, the easier way is to browse the site with DevTools open and check both the HTML and Network tab. You will have a clear vision and decide how to extract the data in a few minutes. These tricks are not always available, but you can save a headache by using them. Metadata, for example, tends to change less than HTML or CSS classes, making it more reliable and maintainable long-term.

We wrote about exploring before coding with examples and code in Python; check out for more info.

DO Parallelize Requests

After switching gear and scaling up, the old one-file sequential script will not be enough. You probably need to "professionalize" it. For a tiny target and a few URLs getting them one by one might be enough. But then scale it to thousands and different domains. It won't work correctly.

One of the first steps of that scaling would be to get several URLs simultaneously and not stop the whole scraping for a slow response. Going from 50-line-script to Google scale is a giant leap, but the first steps are achievable. There are the main things you'll need: concurrency and a queue.

Concurrency

The main idea is to send multiple requests simultaneously but with a limit. And then, as soon as a response arrives, send a new one. Let's say the limit is ten. That would mean that ten URLs would always be running at any given time until there are no more, which brings us to the next step.

We wrote a guide on using concurrency (examples in Python and Javascript).

Queue

A queue is a data structure that allows adding items to be processed later. You can start the crawling with a single URL, get the HTML and extract the links you want. Add those to the queue, and they will start running. Keep on doing the same, and you built a scalable crawler. Some points are missing, like deduplicating URLs (not crawling the same one twice) or infinite loops. But the easy way to solve it would be to set a maximum number of pages crawled and stop once you get there.

We have an article with an example in Python scraping from a seed URL.

Still far from Google scale (obviously), but you can go to thousands of pages with this approach. To be more precise, you can have different settings per domain to avoid overloading a single target. We'll leave that up to you 😉

DON'T Use Headless Browsers for Everything

Selenium, Puppeteer, and Playwright are great, no doubt, but not a silver bullet. They bring a resource overhead and slow down the scraping process. So why use them? 100% needed for Javascript rendered content and helpful in many circumstances. But ask yourself if that's your case.

Most of the sites serve the data, one way or another, on the first HTML request. Because of that, we advocate going the other way around. Test first plain HTML by using your favorite tool and language (cURL, requests in Python, Axios in Javascript, whatever). Check for the content you need: text, IDs, prices. Be careful here since sometimes the data you see on the browser might be encoded (i.e.," shown in plain HTML as "). Copy & paste might not work. 😅

In some cases, you won't find the info because it is not there on the first load, for example, in Angular.io. No problem, headless browsers come in handy for those cases. Or XHR scraping as shown above for Auction.

If you find the info, try to write the extractors. A quick hack might be good enough for a test. Once you have identified all the content you want, the following point is to separate generic crawling code from the custom one for the target site.

Using Python's "requests": 2.41 seconds
A playwright with chromium opening a new browser per request: 11.33 seconds
Playwright with chromium sharing browser and context for all the URLs: 7.13 seconds

It is not 100% conclusive nor statistically accurate, but it shows the difference. In the best case, we are talking about 3x slower using Playwright, and sharing context is not always a good idea. And we are not even talking about CPU and memory consumption.

DON'T Couple Code to Target

Some actions are independent of the website you are scraping: get HTML, parse it, queue new links to crawl, store content, and more. In an ideal scenario, we would separate those from the ones that depend on the target site: CSS selectors, URL structure, DDBB structure.

The first script is usually entangled; no problem there. But as it grows and new pages are added, separating responsibilities is crucial. We know, easier said than done. But to pause and think matters to develop a maintainable and scalable scraper.

We published a repository and blog post about distributed crawling in Python. It is a bit more complicated than what we've seen so far. It uses external software (Celery for asynchronous task queue and Redis as the database).

How to get the HTML (requests VS headless browser)
Filter URLs to queue for crawling
What content to extract (CSS selectors)
Where to store the data (a list in Redis)

# ... 
def extract_content(url, soup): 
    # ... 

def store_content(url, content): 
    # ... 

def allow_url_filter(url): 
    # ... 

def get_html(url): 
    return headless_chromium.get_html(url, headers=random_headers(), proxies=random_proxies())

It is still far from massive scale production-ready. But code reuse is easy, as is adding new domains. And when adding updated browsers or headers, it would be easy to modify the old scrapers to use those.

DON'T Take Down your Target Site

Your extra load might be a drop in the ocean for Amazon but a burden for a small independent store. Be mindful of the scale of your scraping and the size of your targets.

You can probably crawl hundreds of pages at Amazon concurrently, and they won't even notice (careful nonetheless). But many websites run on a single shared machine with poor specs, and they deserve our understanding. Tune down your scripts capabilities for those sites. It might complicate the code, but stopping if the response times increase would be nice.

Another point is to inspect and comply with their robots.txt. Mainly two rules: do not scrape disallowed pages and obey Crawl-Delay. That directive is not common, but when present, represents the amount of seconds crawlers should wait between requests. There is a Python module that can help us to comply with robots.txt.

We will not go into details but do not perform malicious activities (there should be no need to say it, just in case). We are always talking about extracting data without breaking the law or causing damage to the target site.

DON'T Mix Headers from Different Browsers

This last technique is for higher-level anti-bot solutions. Browsers send several headers with a set format that varies from version to version. And advanced solutions check those and compare them to a real-world header set database. Which means you will raise red flags when sending the wrong ones. Or even more difficult to notice, by not sending the right ones! Visit httpbin to see the headers your browser sends. Probably more than you imagine and some you haven't even heard of! "Sec-Ch-Ua"? 😕

There is no easy way out of this but to have an actual full set of headers. And to have plenty of them, one for each User-Agent you use. Not one for Chrome and another for iPhone, nope. One. Per. User-Agent. 🤯

Some people try to avoid this by using headless browsers, but we already shaw why it is better to avoid them. And anyway, you are not on the clear with them. They send the whole set of headers that work for that browser on that version. If you modify any of that, the rest might not be valid. If using Chrome with Puppeteer and overwriting the UA to use the iPhone one... you can have a surprise. A real iPhone does not send "Sec-Ch-Ua", but Puppeteer will since you overwrote UA but didn't delete that one.

Some sites offer a list of User-Agents. But it is hard to get the complete sets for hundreds of them, which is the needed scale when scraping at complex sites.

# ... 

header_sets = [ 
    { 
        "Accept-Encoding": "gzip, deflate, br", 
        "Cache-Control": "no-cache", 
        "User-Agent": "Mozilla/5.0 (iPhone ...", 
        # ... 
    }, { 
        "User-Agent": "Mozilla/5.0 (Windows ...", 
        # ... 
    }, 
    # ... more header sets 
] 

for url in urls: 
    # ... 
    headers = random.choice(header_sets) 
    response = requests.get(url, proxies=proxies, headers=headers) 
    print(response.text)

We know this last one was a bit picky. But some anti-scraping solutions can be super-picky and even more than headers. Some might check browser or even connection fingerprinting — high-level stuff.

Conclusion

Rotating IPs and having good headers will allow you to crawl and scrape most websites. Use headless browsers only when necessary and apply Software Engineering good practices.

Build small and grow from there, add functionalities and use cases. But always try to keep scale and maintainability in mind while keeping success rates high. Don't despair if you get blocked from time to time, and learn from every case.

Web scraping at scale is a challenging and long journey, but you might not need the best ever system. Nor a 100% accuracy. If it works on the domains you want, good enough! Do not freeze trying to reach perfection since you probably don't need it.

In case of doubts, questions, or suggestions, do not hesitate to contact us.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Intro to Web Scraping with Selenium in Python

Ander Rodriguez — Tue, 30 Nov 2021 09:42:59 +0000

Ever heard of headless browsers? Mainly used for testing purposes, they give us an excellent opportunity for scraping websites that require Javascript execution or any other feature that browsers offer.

You'll learn how to use Selenium and its multiple features to scrape and browser any web page. From finding elements to waiting for dynamic content to load. Modify the window size and take screenshots. Or add proxies and custom headers to avoid blocks. You can achieve all of that and more with this headless browser.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Selenium, Chrome, and the driver for Chrome. Make sure to match the browser and driver versions, Chrome 96, as of this writing.

pip install selenium

Other browsers are available (Edge, IE, Firefox, Opera, Safari), and the code should work with minor adjustments.

Getting Started

Once set up, we will write our first test. Go to a sample URL and print its current URL and title. The browser will follow redirects automatically and load all the resources - images, stylesheets, javascript, and more.

from selenium import webdriver

url = "http://zenrows.com"
with webdriver.Chrome() as driver:
    driver.get(url)

    print(driver.current_url) # https://www.zenrows.com/
    print(driver.title) # Web Scraping API & Data Extraction - ZenRows

If your Chrome driver is not in an executable path, you need to specify it or move the driver to somewhere in the path (i.e., /usr/bin/).

chrome_driver_path = '/path/to/chromedriver'
with webdriver.Chrome(executable_path=chrome_driver_path) as driver:
    # ...

You noticed that the browser is showing, and you can see it, right? It won't run headless by default. We can pass options to the driver, which is what we want to do for scraping.

options = webdriver.ChromeOptions()
options.headless = True
with webdriver.Chrome(options=options) as driver:
    # ...

Finding Elements and Content

Once the page is loaded, we can start looking for the information we are after. Selenium offers several ways to access elements: ID, tag name, class, XPath, and CSS selectors.

Let's say that we want to search for something on Amazon by using the text input. We could use select by tag from the previous options: driver.find_element(By.TAG_NAME, "input"). But this might be a problem since there are several inputs on the page. By inspecting the page, we see that it has an ID, so we change the selector: driver.find_element(By.ID, "twotabsearchtextbox").

IDs probably don't change often, and they are a more secure way of extracting info than classes. The problem usually comes from not finding them. Assuming there is no ID, we can select the search form and then the input inside.

from selenium import webdriver
from selenium.webdriver.common.by import By

url = "https://www.amazon.com/"

with webdriver.Chrome(options=options) as driver:
    driver.get(url)

    input = driver.find_element(By.CSS_SELECTOR,
        "form[role='search'] input[type='text']")

There is no silver bullet; each option is appropriate for a set of cases. You'll need to find the one that best suits your needs.

If we scroll down the page, we'll see many products and categories. And a shared class that often repeats: a-list-item. We need a similar function (find_elements in plural) to match all the items and not just the first occurrence.

#...
    driver.get(url)
    items = driver.find_elements(By.CLASS_NAME, "a-list-item")

Now we need to do something with the selected elements.

Interacting with the Elements

We'll search using the input selected above. For that, we need the send_keys function that will type and hit enter to send the form. We could also type into the input and then find the submit button and click on it (element.click()). It is easier in this case since the Enter works fine.

from selenium.webdriver.common.keys import Keys
#...
    input = driver.find_element(By.CSS_SELECTOR,
        "form[role='search'] input[type='text']")
    input.send_keys('Python Books' + Keys.ENTER)

Notice that the script does not wait and closes as soon as the search finishes. The logical thing is to do something afterward, so we'll list the results using find_elements as above. Inspecting the result, we can use the s-result-item class.

These items we will just select are divs with several inner tags. We could take the link's href values if interested and visit each item - we won't do that for the moment. But the h2 tags contain the book's title, so we need to select the title for each element. We can continue using find_element since it will work for driver, as seen before, and for any web element.

# ...
    items = driver.find_elements(By.CLASS_NAME, "s-result-item")

    for item in items:
        h2 = item.find_element(By.TAG_NAME, "h2")
        print(h2.text) # Prints a list of around fifty items

# Learning Python, 5th Edition ...

Don't rely too much on this approach since some tags might be empty or have no title. We should adequately implement error control for an actual use case.

Infinite Scroll

For those cases when there is an infinite scroll (Pinterest), or images are lazily loaded (Twitter), we can go down also using the keyboard. Not often used, but scroll using the space bar, "Page Down", or "End" keys is an option. And we can take advantage of that.

The driver won't accept it directly. We need to find first an element like body and send the keys there.

driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)

But there is still another problem: items will not be present just after scrolling. That brings us to the next part.

Wait for Content or Element

Nowadays, many websites are Javascript intense - especially when using modern frameworks like React - and do lots of XHR calls after the first load. As with the infinite scroll, all that content won't be available to Selenium. But we can manually inspect the target website and check what the result of that processing is.

It usually comes down to creating some DOM elements. If those classes are unique or they have IDs, we can wait for those. We can use the WebDriverWait to put the script on hold until some criteria are met.

Assume a simple case where there are no images present until some XHR finishes. This instruction will return the img element as soon as it appears. The driver will wait for 3 seconds and fail otherwise.

from selenium.webdriver.support.ui import WebDriverWait
# ...
    el = WebDriverWait(driver, timeout=3).until(
        lambda d: d.find_element(By.TAG_NAME, "img"))

Selenium provides several expected conditions that might prove valuable. element_to_be_clickable is an excellent example in a page full of Javascript, since many buttons are not interactive until some actions occur.

from selenium.webdriver.support import expected_conditions as EC
#...
    button = WebDriverWait(driver, 3).until(
        EC.element_to_be_clickable((By.CLASS_NAME, 'my-button')))

Screenshots and Element Screenshots

Be it for testing purposes or storing changes, screenshots are a practical tool. We can take a screenshot for the current browser context or a given element.

# ...
    driver.save_screenshot('page.png')

# ... 
    card = driver.find_element(By.CLASS_NAME, "a-cardui") 
    card.screenshot("amazon_card.png")

Noticed the problem with the first image? Nothing wrong, but the size is probably not what you were expecting. Selenium loads by default in 800px by 600px when browsing in headless mode. But we can modify it to take bigger screenshots.

Window Size

We can query the driver to check the size we're launching in: driver.get_window_size(), which will print {'width': 800, 'height': 600}. When using GUI, those numbers will change, so let's assume that we're testing headless mode.

There is a similar function - set_window_size - that will modify the window size. Or we can add an options argument to the Chrome web driver that will directly start the browser with that resolution.

options.add_argument("--window-size=1024,768")

with webdriver.Chrome(options=options) as driver:
    print(driver.get_window_size())
    # {'width': 1024, 'height': 768}

    driver.set_window_size(1920,1200)

    driver.get(url)

    print(driver.get_window_size())
    # {'width': 1920, 'height': 1200}

And now our screenshot will be 1920px wide.

Custom Headers

The options mentioned above provide us with a crucial mechanism for web scraping: custom headers.

User Agent

One of the essential headers to avoid blocks is User-Agent. Selenium will provide an accurate one by default, but you can change it for a custom one. Remember that there are many techniques to crawl and scrape without blocks and we won't cover them all here.

user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
options.add_argument('user-agent=%s' % user_agent)
with webdriver.Chrome(options=options) as driver:
    driver.get(url)
    print(driver.find_element(By.TAG_NAME, "body").text) # UA matches the one hardcoded above, v93

Other Important Headers

As a quick summary, changing the user-agent might be counterproductive if we forget to adjust some other headers. For example, the sec-ch-ua header usually sends a version of the browser, and it must much the user-agent's one: "Google Chrome";v="96". But some older versions do not send that header at all, so sending it might also be suspicious.

The problem is Selenium does not support adding headers. A third-party solution like Selenium Wire might solve it. Install it with pip install selenium-wire.

It will allow us to intercept requests, among other things, and modify the headers we want or add new ones. When changing, we must delete the original one first to avoid sending duplicates.

from seleniumwire import webdriver

url = "http://httpbin.org/anything"
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
sec_ch_ua = '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"'
referer = 'https://www.google.com'

options = webdriver.ChromeOptions()
options.headless = True

def interceptor(request):
    del request.headers['user-agent']  # Delete the header first
    request.headers['user-agent'] = user_agent
    request.headers['sec-ch-ua'] = sec_ch_ua
    request.headers['referer'] = referer

with webdriver.Chrome(options=options) as driver:
    driver.request_interceptor = interceptor
    driver.get(url)
    print(driver.find_element(By.TAG_NAME, "body").text)

Proxy to change the IP

As with the headers, Selenium has limited support for proxies. We can add a proxy without authentication as a driver option. For testing, we'll use Free Proxies although they are not reliable, and the one below probably won't work for you at all. They are usually short-lived.

from selenium import webdriver
# ...
url = "http://httpbin.org/ip"
proxy = '85.159.48.170:40014' # free proxy
options.add_argument('--proxy-server=%s' % proxy)

with webdriver.Chrome(options=options) as driver:
    driver.get(url)
    print(driver.find_element(By.TAG_NAME, "body").text) # "origin": "85.159.48.170"

For more complex solutions or ones in need of auth, Selenium Wire can help us again. We need a second set of options in this case, where we will add the proxy server we want to use.


proxy_pass = "YOUR_API_KEY"
seleniumwire_options = {
    'proxy': {
        "http": f"http://{proxy_pass}:@proxy.zenrows.com:8001",
        'verify_ssl': False,
    },
}

with webdriver.Chrome(options=options,
        seleniumwire_options=seleniumwire_options) as driver:
    driver.get(url)
    print(driver.find_element(By.TAG_NAME, "body").text)

For proxy servers that don't rotate IPs automatically, driver.proxy can be overwritten. From that point on, all requests will use the new proxy. This action can be done as many times as necessary. For convenience and reliability, we advocate for Smart Rotating Proxies.

#...
    driver.get(url)  # Initial proxy
    driver.proxy = {
        'http': 'http://user:pass@1.2.3.4:5678',
    }
    driver.get(url)  # New proxy

Blocking Resources

For performance, saving bandwidth, or avoiding tracking, blocking some resources might prove crucial when scaling scraping.

from selenium import webdriver

url = "https://www.amazon.com/"

options = webdriver.ChromeOptions()
options.headless = True
options.experimental_options["prefs"] = {
    "profile.managed_default_content_settings.images": 2
}

with webdriver.Chrome(options=options) as driver:
    driver.get(url)
    driver.save_screenshot('amazon_without_images.png')

We could even go a step further and avoid loading almost any type. Careful with this since blocking Javascript would mean no AJAX calls, for example.

options.experimental_options["prefs"] = {
    "profile.managed_default_content_settings.images": 2,
    "profile.managed_default_content_settings.stylesheets": 2,
    "profile.managed_default_content_settings.javascript": 2,
    "profile.managed_default_content_settings.cookies": 2,
    "profile.managed_default_content_settings.geolocation": 2,
    "profile.default_content_setting_values.notifications": 2,
}

Intercepting Requests

Once again, thanks to Selenium Wire, we could decide programmatically over requests. It means that we can effectively block some images while allowing others. And we also can block domains using exclude_hosts or allow only specific requests based on URLs matching against a regular expression with driver.scopes.

def interceptor(request):
    # Block PNG and GIF images, will show JPEG for example
    if request.path.endswith(('.png', '.gif')):
        request.abort()

with webdriver.Chrome(options=options) as driver:
    driver.request_interceptor = interceptor
    driver.get(url)

Execute Javascript

The last Selenium feature we want to mention is executing Javascript. Some things are easier done directly in the browser, or we want to check that it worked correctly. We can execute_script passing the JS code we want to be executed. It can go without params or with elements as params.

We can see both cases in the examples below. There is no need for params to get the User-Agent as the browser sees it. That might prove helpful to check that the one sent is being modified correctly in the navigator object since some security checks might raise red flags otherwise. The second one will take an h2 as an argument and return its left position by accessing getClientRects.

with webdriver.Chrome(options=options) as driver:
    driver.get(url)

    agent = driver.execute_script("return navigator.userAgent")
    print(agent) # Mozilla/5.0  ... Chrome/96 ...

    header = driver.find_element(By.CSS_SELECTOR, "h2")
    headerText = driver.execute_script(
        'return arguments[0].getClientRects()[0].left', header)
    print(headerText) # 242.5

Conclusion

Selenium is a valuable tool with many applications, but you have to take advantage of them in your way. Apply each feature in your favor. And many times, there are several ways of arriving at the same point; look for the one that helps you the most - or the easiest one.

Once you get the handle, you'll want to grow your scraping and get more pages. There is where other challenges might appear: crawling at scale and blocks. Some tips above will help you: check the headers and proxy sections. But also be aware that crawling at scale is not an easy task. Please don't say we didn't warn you 😅

I hope you leave with an understanding of how Selenium works in Python (it goes the same for other languages). An important topic that we did not cover is when Selenium is necessary. Because many times you can save time, bandwidth, and server performance by scraping without a browser. Or you can contact us, and we'll be delighted to help you crawl, scrape and scale whatever you need!

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Web Scraping: Intercepting XHR Requests

Ander Rodriguez — Wed, 27 Oct 2021 13:25:39 +0000

Have you ever tried scraping AJAX websites? Sites full of Javascript and XHR calls? Decipher tons of nested CSS selectors? Or worse, daily changing selector? Maybe you won't need that ever again. Keep on reading, XHR scraping might prove your ultimate solution!

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.



pip install playwright
playwright install

Intercept Responses

As we saw in a previous blog post about blocking resources, headless browsers allow request and response inspection. We will use Playwright in python for the demo, but it can be done in Javascript or using Puppeteer.

We can quickly inspect all the responses on a page. As we can see below, the response parameter contains the status, URL, and content itself. And that's what we'll be using instead of directly scraping content in the HTML using CSS selectors.



page.on("response", lambda response: print(
    "<<", response.status, response.url))

Use case: auction.com

Our first example will be auction.com. You might need proxies or a VPN Since it blocks outside of the countries they operate in. Anyway, it might be a problem trying to scrape from your IP since they will ban it eventually. Check out how to avoid blocking if you find any issues.

Here is a basic example of loading the page using Playwright while logging all the responses.



from playwright.sync_api import sync_playwright

url = "https://www.auction.com/residential/ca/"

with sync_playwright() as p:
    browser = p.firefox.launch()
    page = browser.new_page()

    page.on("response", lambda response: print(
        "<<", response.status, response.url))
    page.goto(url, wait_until="networkidle", timeout=90000)

    print(page.content())

    page.context.close()
    browser.close()

auction.com will load an HTML skeleton without the content we are after (house prices or auction dates). They will then load several resources such as images, CSS, fonts, and Javascript. If we wanted to save some bandwidth, we could filter out some of those. For now, we're going to focus on the attractive parts.

As we can see in the network tab, almost all relevant content comes from an XHR call to an assets endpoint. Ignoring the rest, we can inspect that call by checking that the response URL contains this string: if ("v1/search/assets?" in response.url).

There is a size and time problem: the page will load tracking and map, which will amount to more than a minute in loading (using proxies) and 130 requests :O. We could do better by blocking certain domains and resources. We were able to do it in under 20 seconds with only 7 loaded resources in our tests. We will leave that as an exercise for you ;)



<< 407 https://www.auction.com/residential/ca/ 
<< 200 https://www.auction.com/residential/ca/ 
<< 200 https://cdn.auction.com/residential/page-assets/styles.d5079a39f6.prod.css 
<< 200 https://cdn.auction.com/residential/page-assets/framework.b3b944740c.prod.js 
<< 200 https://cdn.cookielaw.org/scripttemplates/otSDKStub.js 
<< 200 https://static.hotjar.com/c/hotjar-45084.js?sv=5 
<< 200 https://adc-tenbox-prod.imgix.net/resi/propertyImages/no_image_available.v1.jpg 
<< 200 https://cdn.mlhdocs.com/rcp_files/auctions/E-19200/photos/thumbnails/2985798-1-G_bigThumb.jpg 
# ...

For a more straightforward solution, we decided to change to the wait_for_selector function. It is not the ideal solution, but we noticed that sometimes the script stops altogether before loading the content. To avoid those cases, we change the waiting method.

While inspecting the results, we saw that the wrapper was there from the skeleton. But each houses' content is not. So we will wait for one of those: "h4[data-elm-id]".



with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in 
        if ("v1/search/assets?" in response.url):
            print(response.json()['result']['assets']['asset'])
    # ...
    page.on("response", handle_response)
    # really long timeout since it gets stuck sometimes
    page.goto(url, timeout=120000)
    page.wait_for_selector("h4[data-elm-id]", timeout=120000)

Here we have the output, with even more info than the interface offers! Everything is clean and nicely formatted 😎



[
  {
    "item_id": "E192003",
    "global_property_id": 2981226,
    "property_id": 5444765,
    "property_address": "13841 COBBLESTONE CT",
    "property_city": "FONTANA",
    "property_county": "San Bernardino",
    "property_state": "CA",
    "property_zip": "92335",
    "property_type": "SFR",
    "seller_code": "FSH",
    "beds": 4,
    "baths": 3,
    "sqft": 1704,
    "lot_size": 0.2,
    "latitude": 34.10391,
    "longitude": -117.50212,
    ...

We could go a step further and use the pagination to get the whole list, but we'll leave that to you.

Use case: twitter.com

Another typical case where there is no initial content is Twitter. To be able to scrape Twitter, you will undoubtedly need Javascript Rendering. As in the previous case, you could use CSS selectors once the entire content is loaded. But beware, since Twitter classes are dynamic and they will change frequently.

What will most probably remain the same is the API endpoint they use internally to get the main content: TweetDetail. In cases like this one, the easiest path is to check the XHR calls in the network tab in devTools and look for some content in each request. It is an excellent example because Twitter can make 20 to 30 JSON or XHR requests per page view.

Once we identify the calls and the responses we are interested in, the process will be similar.



import json
from playwright.sync_api import sync_playwright

url = "https://twitter.com/playwrightweb/status/1396888644019884033"

with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in
        if ("/TweetDetail?" in response.url):
            print(json.dumps(response.json()))

    browser = p.firefox.launch()
    page = browser.new_page()
    page.on("response", handle_response)
    page.goto(url, wait_until="networkidle")
    page.context.close()
    browser.close()

The output will be a considerable JSON (80kb) with more content than we asked for. More than ten nested structures until we arrive at the tweet content. The good news is that we can now access favorite, retweet, or reply counts, images, dates, reply tweets with their content, and many more.

Use case: nseindia.com

Stock markets are an ever-changing source of essential data. Some sites offering this info, such as the National Stock Exchange of India, will start with an empty skeleton. After browsing for a few minutes on the site, we see that the market data loads via XHR.

Another common clue is to view the page source and check for content there. If it's not there, it usually means that it will load later, which probably requires XHR requests. And we can intercept those!

Since we are parsing a list, we will loop over it a print only part of the data in a structured way: symbol and price for each entry.



from playwright.sync_api import sync_playwright

url = "https://www.nseindia.com/market-data/live-equity-market"

with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in
        if ("equity-stockIndices?" in response.url):
            item = response.json()['data'][1]
            print(item['symbol'], item['lastPrice'])

    browser = p.firefox.launch()
    page = browser.new_page()

    page.on("response", handle_response)
    page.goto(url, wait_until="networkidle")

    page.context.close()
    browser.close()

# Output:
# NIFTY 50 18125.4
# ICICIBANK 846.75
# AXISBANK 845
# ...

As in the previous examples, this is a simplified example. Printing is not the solution to a real-world problem. Instead, each page structure should have a content extractor and a method to store it. And the system should also handle the crawling part independently.

Conclusion

We'd like you to go with three main points:

Inspect the page looking for clean data
API endpoints change less often than CSS selectors, and HTML structure
Playwright offers more than just Javascript rendering

Even if the extracted data is the same, fail-tolerance and effort in writing the scraper are fundamental factors. The less you have to change them manually, the better.

Apart from XHR requests, there are many other ways to scrape data beyond selectors. Not every one of them will work on a given website, but adding them to your toolbelt might help you often.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Blocking Resources with Playwright

Ander Rodriguez — Wed, 29 Sep 2021 13:35:11 +0000

Did you know that Playwright allows you to block requests and thus speed up your scraping or testing? You can block certain resource types like images, any requests by domain, or many different ways.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.



pip install playwright
playwright install

Intro to Playwright

Playwright "is a Python library to automate Chromium, Firefox, and WebKit browsers with a single API." It allows us to browse the Internet with a headless browser programmatically.

Playwright is also available for Node.js, and everything shown below can be done with a similar syntax. Check the docs for more details.

Here's how to start a browser (i.e., Chromium) in a few lines, navigate to a page, and get its title.



from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://www.zenrows.com/")
    print(page.title())
    # Web Scraping API & Data Extraction - ZenRows
    page.context.close()
    browser.close()

Loggin Network Events

Subscribe to events such as request or response and log their content to see what is happening. Since we did not tell Playwright otherwise, it will load the entire page: HTML, CSS, execute Javascript, get images, and so on. Add these two lines before requesting the page to see what's going on.



page.on("request", lambda request: print(
    ">>", request.method, request.url,
    request.resource_type))
page.on("response", lambda response: print(
    "<<", response.status, response.url))

page.goto("https://www.zenrows.com/")

# >> GET https://www.zenrows.com/ document
# << 200 https://www.zenrows.com/
# >> GET https://cdn.zenrows.com/images_dash/logo-instagram.svg image
# << 200 https://cdn.zenrows.com/images_dash/logo-instagram.svg

The entire output is almost 50 lines long, with 24 resource requests. We probably don't need most of those for website scraping, so we will see how to block them and save time and bandwidth.

Blocking Resources

Why load resources and content that we won't use? Learn how to avoid unnecessary data and network requests with these techniques.

Block by Glob Pattern

page also exposes a method route that will execute a handler for each matching route or pattern. Let's say that we don't want SVGs to load. Using a pattern like "**/*.svg" will match requests ending with that extension. As for the handler, we need no logic for the moment, only to abort the request. For that, we'll use a lambda and the route param's abort method.



page.route("**/*.jpg", lambda route: route.abort())
page.goto("https://www.zenrows.com/")

Note: according to the official documentation, patterns like "**/*.{png,jpg,jpeg}" should work, but we found otherwise. Anyway, it's doable with the next blocking strategy.

Block by Regex

If (for some reason 😜) you are into Regex, feel free to use them. But compiling them first is mandatory. We will block three image extensions in this case. Regex are tricky, but they offer a ton of flexibility.



import re
# ...
    page.route(re.compile(r"\.(jpg|png|svg)$"),
        lambda route: route.abort())
    page.goto("https://www.zenrows.com/")

Now there are 23 requests and only 15 responses. We saved 8 images from being downloaded!

Block by Resource Type

But what happens if they use "jpeg" extension instead of "jpg"? Or avif, gif, webp? Should we maintain an updated list?

Luckily for us, the route param exposed in the lambda function above includes the original request and resource type. And one of those types is image, perfect! You can access the whole resource type list.

We'll now match every request ("**/*") and add conditional logic to the lambda function. In case it is an image, abort the request as before. Else, continue with it as usual.



page.route("**/*", lambda route: route.abort()
    if route.request.resource_type == "image"
    else route.continue_()
)
page.goto("https://www.zenrows.com/")

Take into consideration that some trackers use images. It is probably not a big deal when scraping or testing, but just in case.

Function handler

We can also define functions for the handlers instead of using lambdas. That comes in handy in case we need to reuse it or it grows past a single conditional.

Suppose that we want to block aggressively now. Looking at the output from the previous runs will show a list of the used resources. We'll add those to a list and then check if the type is in that list.



excluded_resource_types = ["stylesheet", "script", "image", "font"]
def block_aggressively(route):
    if (route.request.resource_type in excluded_resource_types):
        route.abort()
    else:
        route.continue_()
# ...
    page.route("**/*", block_aggressively)
    page.goto("https://www.zenrows.com/")

We are entirely in control now, and the versatility is absolute. From routes.request, the original URL, the headers, and several other info are available.

Being even more strict: block everything that is not document type. That will effectively prevent anything but the initial HTML from being loaded.



def block_aggressively(route):
    if (route.request.resource_type != "document"):
        route.abort()
    else:
        route.continue_()

There is a single response now! We got the HTML without any other resource being downloaded. We sure have saved a lot of time and bandwidth, right? But... how much exactly?

Measuring Performance Boost

We can only say that it got better if we could measure the differences. We will take a look at three approaches. But just running a script with several URLs will do. Spoiler: we did that for 10 URLs, 1.3 seconds VS 8.4.

HAR Files

For those of you used to checking the DevTools Network tab, we have good news! Playwright allows HAR recording by providing an extra parameter in the new_page method. As easy as that.



page = browser.new_page(record_har_path="playwright_test.har")
page.goto("https://www.zenrows.com/")

There are some HAR visualizers out there, but the easiest way is to use Chrome DevTools. Open the Network tab and click on the import button or drag&drop the HAR file.

Check time! Below is the comparison between two different HAR files. The first one is without blocking (regular navigation). The second one is blocking everything except for the initial document.

Almost every resource has a "-1" Status and "Pending" Time on the blocking side. That's DevTools's way of telling us that those were blocked and not downloaded. We can see clearly on the bottom left that we performed fewer requests, and the transferred data amount is a fraction of the original! From 524kB to 17.4kB, a 96% cut.

Browser's Performance API

The browsers offer an interface to check the performance that shows how it went for things like timing. Playwright can evaluate Javascript, so we'll use it to print those results.

The output will be a JSON object with a lot of timestamps. The most straightforward check is to get the difference between navigationStart and loadEventEnd. When blocking, it should be under half a second (i.e., 346ms); for regular navigation, above a second or even two (i.e., 1363ms).



page.goto("https://www.zenrows.com/")
print(page.evaluate("JSON.stringify(window.performance)"))
# {"timing":{"connectStart":1632902378272,"navigationStart":1632902378244, ...

As you can see, blocking can be a second faster, even more for slower sites. The less you download, the faster you can scrape!

CDP Session

Going a step further, we connect directly with Chrome DevTools Protocol. Playwright creates a CDP Session for us to extract, for example, performance metrics.

We have to create a client from the page context and start communication with CDP. In our case, we will enable "Performance" before visiting the page and getting the metrics after it.

The output will be a JSON-like string with interesting values such as nodes, process time, JS Heap used, and many more.



client = page.context.new_cdp_session(page)
client.send("Performance.enable")
page.goto("https://www.zenrows.com/")
print(client.send("Performance.getMetrics"))

Conclusion

We'd like you to part with three main points:

Load only needed resources.
Save time and bandwidth when possible.
Measure your efforts and performance before scaling up.

Let's not forget that website scraping is a process with multiple steps, one of them being rotating proxies. Those add processing time and, sometimes, charge per bandwidth.

You can achieve precisely the same results while saving time/bandwidth/money. Many times, images or CSS are only overhead. Some others, no need for JS if you only want the initial static content.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Web Scraping with Javascript and Node.js

Ander Rodriguez — Wed, 01 Sep 2021 14:51:35 +0000

Javascript and web scraping are both on the rise. We will combine them to build a simple scraper and crawler from scratch using Javascript in Node.js.

Avoiding blocks is an essential part of website scraping, so we will also add some features to help in that regard. And finally, parallelize the tasks to go faster thanks to Node's event loop.

Prerequisites

For the code to work, you will need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install.

npm install axios cheerio playwright

Introduction

We are using Node v12, but you can always check the compatibility of each feature.

Axios is a "promise based HTTP client" that we will use to get the HTML from a URL. It allows several options such as headers and proxies, which we will cover later. If you use TypeScript, they "include TypeScript definitions and a type guard for Axios errors."

Cheerio is a "fast, flexible & lean implementation of core jQuery." It lets us find nodes with selectors, get text or attributes, and many other things. We will pass the HTML to cheerio and then query it as we would in a browser environment.

Playwright "is a Node.js library to automate Chromium, Firefox and WebKit with a single API." When Axios is not enough, we will get the HTML using a headless browser to execute Javascript and wait for the async content to load.

Scraping the Basics

The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. We'll use scrapeme.live as an example, a fake website prepared for scraping.

Nice! Then, using cheerio, we can query for the two things we want right now: paginator links and products. To know how to do that, we will look at the page with Chrome DevTools open. All modern browsers offer developer tools such as these. Pick your favorite.

We marked the interesting parts in red, but you can go on your own and try it yourselves. In this case, all the CSS selectors are straightforward and do not need nesting. Check the guide if you are looking for a different outcome or cannot select it. You can also use DevTools to get the selector.

On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector.
But the outcome is usually very coupled to the HTML, as in this case: #main > div:nth-child(2) > nav > ul > li:nth-child(2) > a. This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them.

We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach. In our case, we only want the pagination links. Using the provided class, .page-numbers a will capture all and then extract the URLs (hrefs) from those. The selector will match all the link nodes with an ancestor containing the class page-numbers.

As for the products (Pokémon in this case), we will get id, name, and price. Check the image below for details on selectors, or try again on your own. We will only log the content for now. Check the final code for adding them to an array.

As you can see above, all the products contain the class product, which makes our job easier. And for each of them, the h2 tag and price node hold the content we want. As for the product ID, we need to match an attribute instead of a class or node type. That can be done using the syntax node[attribute="value"]. We are looking only for the node with the attribute, so there is no need to match it to any particular value.

There is no error handling, as you can see above. We will omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e...., empty array) should do the trick.

Following Links

Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice - there are two pagination bars.

We will add two sets to keep track of what we already visited and the newly discovered links. We are using sets instead of arrays to avoid dealing with duplicates, but either one would work. To avoid crawling too much, we'll also include a maximum.

For the next part, we will use async/await to avoid callbacks and nesting. An async function is an alternative to writing promise-based functions as chains. In this case, the Axios call will remain asynchronous. It might take around 1 second per page, but we write the code sequentially, with no need for callbacks.

There is a small gotcha with this: await is only valid in async function. That will force us to wrap the initial code inside a function, concretely in an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.

Avoid Blocks

As said before, we need mechanisms to avoid blocks, captchas, login walls, and several other defensive techniques. It is complicated to prevent them 100% of the time. But we can achieve a high success rate with simple efforts. We will apply two tactics: adding proxies and full-set headers.

There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.
Note that these free proxies might not work for you. They are short-time lived.

Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.

We will use httpbin for testing. It offers several endpoints that will respond with headers, IP addresses, and many more.

The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. Many software tools have their own, for example, Axios (axios/0.21.1). In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.

Headless Browsers

Until now, every page visited was done using axios.get, which can be inadequate in some cases. Say we need Javascript to load and execute or interact in any way with the browser (via mouse or keyboard). While avoiding them would be preferable - for performance reasons -, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).

This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.

As with Axios, we can provide extra headers, proxies, and many other options to customize every request. An excellent choice to hide our "HeadlessChrome" User-Agent. And since this is a real browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more.

Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded. The output, by the way, is the same but quite faster when using Axios.

Using Javascript's Async

We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, just by removing the await would be enough, right? Well... not so fast.

The function would call the first crawl and immediately take the following item from the toVisit set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we already exited from the main one.

To do this properly, we need to create a queue that will execute tasks when available. To avoid many requests at the same time, we will limit its concurrency.

If you run the code above, it will print numbers from 0 to 3 almost immediately (with a timestamp) and from 4 to 7 after 2 seconds. It might be the hardest snippet to understand - review it without hurries.

We define queue in lines 1-20. It will return an object with the function enqueue to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to running and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from running. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions sleep and printer. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).

We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.

Remember that Node runs in a single thread, so we can take advantage of its event loop but cannot use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time - network requests do not consume CPU time.

To build this further, we need to use some storage (database) or distributed queue system. Right now, we rely on variables, which are not shared between threads in Node. It is not overly complicated, but we covered enough ground in this blog post.

Final Code

Conclusion

We'd like you to part with four main points:

Understand the basics of website parsing and crawling.
Separate responsibilities and use abstractions when necessary.
Apply the required techniques to avoid blocks.
Be able to figure out the following steps to scale up.

We can build a custom web scraper using Javascript and Node.js using the pieces we've seen. It might not scale to thousands of websites, but it will run perfectly for a few ones. And moving to distributed crawling is not that far from here.

If you liked it, you might be interested in the Python Web Scraping guide.

Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Mastering Web Scraping in Python: Scaling to Distributed Crawling

Ander Rodriguez — Wed, 25 Aug 2021 13:35:12 +0000

Wondering how to build a website crawler and parser at scale? Implement a project to crawl, scrape, extract content, and store it at scale in a distributed and fault-tolerant manner. We will take all the knowledge from previous posts and combine it.

First, we learned about pro techniques to scrape content, although we'll only use CSS selectors today. Then tricks to avoid blocks, from which we will add proxies, headers, and headless browsers. And lastly, we built a parallel crawler, and this blog post begins with that code.

If you do not understand some part or snippet, it might be in an earlier post. Brace yourselves; lengthy snippets are coming.

Prerequisites

For the code to work, you will need Redis and python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install install requests beautifulsoup4 playwright "celery[redis]"
npx playwright install

Intro to Celery and Redis

Celery "is an open source asynchronous task queue." We created a simple parallel version in the last blog post. Celery takes it a step further by providing an actual distributed queue implementation. We will use it to distribute our load among workers and servers.

Redis "is an open source, in-memory data structure store, used as a database, cache, and message broker." Instead of using arrays and sets to store all the content (in memory), we will use Redis as a database. Moreover, Celery can use Redis as a broker, so we won't need other software to run it.

Simple Celery Task

Our first step will be to create a task in Celery that prints the value received by parameter. Save the snippet in a file called tasks.py and run it. If you run it as a regular python file, only one string will be printed. The console will print two different lines if you run it with celery -A tasks worker.

The difference is in the demo function call. Direct call implies "execute that task," while delay means "enqueue it for a worker to process." Check the docs for more info on calling tasks.

The celery command will not end; we need to kill it by exiting the console (i.e., ctrl + C). We'll need it several times because Celery does not reload after code changes.

Crawling from Task

The next step is to connect a Celery task with the crawling process. This time we will be using a slightly altered version of the helper functions seen in the last post. extract_links will get all the links on the page except the nofollow ones. We will add filtering options later.

We could loop over the retrieved links and enqueue them, but that would end up crawling the same pages repeatedly. We saw the basics to execute tasks, and now we will start splitting into files and keeping track of the pages on Redis.

Redis for Tracking URLs

We already said that relying on memory variables is not an option anymore. We will need to persist all that data: visited pages, the ones being currently crawled, keep a "to visit" list, and store some content later on. For all that, instead of enqueuing directly to Celery, we will use Redis to avoid re-crawling and duplicates. And enqueue URLs only once.

We won't go into further details on Redis, but we will use lists, sets, and hashes.

Take the last snippet and remove the last two lines, the ones calling the task. Create a new file main.py with the following content. We will create a list named crawling:to_visit and push the starting URL. Then we will go into a loop that will query that list for items and block for a minute until an item is ready. When an item is retrieved, we call the crawl function, enqueuing its execution.

It does almost the same as before but allows us to add items to the list, and they will be automatically processed. We could do that easily by looping over links and pushing them all, but it is not a good idea without deduplication and a maximum number of pages. We will keep track of all the queued and visited using sets and exit once their sum exceeds the maximum allowed.

After executing, everything will be in Redis, so running again won't work as expected. We need to clean manually. We can do that by using redis-cli or a GUI like redis-commander. There are commands for deleting keys (i.e., DEL crawling:to_visit) or flushing the database (careful with this one).

Separate Responsabilities

We will start to separate concepts before the project grows. We already have two files: tasks.py and main.py. We will create another two to host crawler-related functions (crawler.py) and database access (repo.py). Please look at the snippet below for the repo file, it is not complete, but you get the idea. There is a GitHub repository with the final content in case you want to check it.

And the crawler file will have the functions for crawling, extracting links, and so on.

Allow Parser Customization

As mentioned above, we need some way to extract and store content and add only a particular subset of links to the queue. We need a new concept for that: default parser (parsers/defaults.py).

And in the repo.py file:

There is nothing new here, but it will allow us to abstract the link and content extraction. Instead of hardcoding it in the crawler, it will be a set of functions passed as parameters. Now we can substitute the calls to these functions by an import (for the moment).

For it to be completely abstracted, we need a generator or factory. We'll create a new file to host it - parserlist.py. To simplify a bit, we allow one custom parser per domain. The demo includes two domains for testing: scrapeme.live and quotes.toscrape.com.

There is nothing done for each domain yet so that we will use the default parser for them.

We can now modify the task with the new per-domain-parsers.

Custom Parser

We will use scrapeme first as an example. Check the repo for the final version and the other custom parser.

Knowledge of the page and its HTML is required for this part. Take a look at it if you want to get the feeling. To summarize, we will get the product id, name, and price for each item in the product list. Then store that in a set using the id as the key. As for the links allowed, only the ones for pagination will go through the filtering.

In the quotes site, we need to handle it differently since there is no ID per quote. We will extract the author and quote for each entry in the list. Then, in the store_content function, we'll create a list for each author and add that quote. Redis handles the creation of the lists when necessary.

With the last couple of changes, we have introduced custom parsers that will be easy to extend. When adding a new site, we must create one file per new domain and one line in parserlist.py referencing it. We could go a step further and "auto-discover" them, but no need to complicate it even more.

Get HTML: Headless Browsers

Until now, every page visited was done using requests.get, which can be inadequate in some cases. Say we want to use a different library or headless browser, but just for some cases or domains. Loading a browser is memory-consuming and slow, so we should avoid it when it is not mandatory. The solution? Even more customization. New concept: collector.

We will create a file named collectors/basic.py and paste the already known get_html function. Then change the defaults to use it by importing it. Next, create a new file, collectors/headless_firefox.py, for the new and shiny method of getting the target HTML. As in the previous post, we will be using playwright. And we will also parametrize headers and proxies in case we want to use them. Spoiler: we will.

If we want to use a headless Firefox for some domain, merely modify the get_html for that parser (i.e., parsers/scrapemelive.py).

As you can see in the final repo, we also have a fake.py collector used in scrapemelive.py. Since we used that website for intense testing, we downloaded all the product pages the first time and stored them in a data folder. We can customize with a headless browser, but we can do the same with a file reader, hence the "fake" name.

Avoid Detection with Headers and Proxies

You guessed it: we want to add custom headers and use proxies. We will start with the headers creating a file headers.py. We won't paste the entire content here, there are three different sets of headers for a Linux machine, and it gets pretty long. Check the repo for the details.

We can import a concrete set of headers or call the random_headers to get one of the available options. We will see a usage example in a moment.

The same applies to the proxies: create a new file, proxies.py. It will contain a list of them grouped by the provider. In our example, we will include only free proxies. Add your paid ones in the proxies dictionary and change the default type to the one you prefer. If we were to complicate things, we could add a retry with a different provider in case of failure.

Note that these free proxies might not work for you. They are short-time lived.

And the usage in a parser:

Bringing it All Together

It's been a long and eventful trip. It is time to put an end to it by completing the puzzle. We hope you understood the whole process and all the challenges that scraping and crawling at scale have.

We cannot show here the final code, so take a look at the repository and do not hesitate to comment or contact us with any doubt.

The two entry points are tasks.py for Celery and main.py to start queueing URLs. From there, we begin storing URLs in Redis to keep track and start crawling the first URL. A custom or the default parser will get the HTML, extract and filter links, and generate and store the appropriate content. We add those links to a list and start the process again. Thanks to Celery, once there is more than one link in the queue, the parallel/distributed process starts.

Points Still Missing

We already covered a lot of ground, but there is always a step more. Here are a few functionalities that we did not include. Also, note that most of the code does not contain error handling or retries for brevity's sake.

Distributed

We didn't include it, but Celery offers it out-of-the-box. For local testing, we can start two different workers celery -A tasks worker --concurrency=20 -n worker1 and ... -n worker2. The way to go is to do the same in other machines as long as they can connect to the broker (Redis in our case). We could even add or remove workers and servers on the fly, no need to restart the rest. Celery handles the workers and distributes the load.

It is important to note that the worker's name is essential, especially when starting several in the same machine. If we execute the above command twice without changing the worker's name, Celery won't recognize them correctly. Thus launch the second one as -n worker2.

Rate Limit

Celery does not allow a rate limit per task and parameter (in our case, domain). Meaning that we can throttle workers or queues, but not to a fine-grained detail as we would like to. There are several issues open and workarounds. From reading several of those, the take-away is that we cannot do it without keeping track of the requests ourselves.

We could easily rate-limit to 30 requests per minute for each task with the provided param @app.task(rate_limit="30/m"). But remember that it would affect the task, not the crawled domain.

Robots.txt

Along with the allow_url_filter part, we should also add a robots.txt checker. For that, the robotparser library can take a URL and tell us if it is allowed to crawl it. We can add it to the default or as a standalone function, and then each scraper decides whether to use it. We thought it was complex enough and did not implement this functionality.

If you were to do it, consider the last time the file was accessed with mtime() and reread it from time to time. And also, cache it to avoid requesting it for every single URL.

Conclusion

Building a custom crawler/parser at scale is not an easy nor straightforward task. We provided some guidance and tips, hopefully helping you all with your day-to-day tasks.

Before developing something as big as this project at scale, think about some important take-aways:

Separate responsabilities.
Use abstractions when necessary, but do not over-engineer.
Don't be afraid of using specialized software instead of building everything.
Think about scaling even if you don't need it now; just keep it in mind.

Thanks for joining us until the end. It's been a fun series to write, and we hope it's also been attractive from your side. If you liked it, you might be interested in the Javascript Web Scraping guide.

Do not forget to take a look at the rest of the posts in this series.

Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Mastering Web Scraping in Python: Crawling from Scratch

Ander Rodriguez — Wed, 11 Aug 2021 11:55:02 +0000

Have you ever tried to crawl thousands of pages? Scale that even further? Handle and recover from system failures?

After seeing how to extract content from a website and how to avoid being blocked, we'll take a look at the crawling process. To get data at scale, getting a few URLs by hand is not an option. We need to use an automated system that will discover new pages and visit them.

Disclaimer: for real-world usage, find suitable software. Below is more info on that. This guide pretends to be an introduction to how the crawling process works and doing the basics. But there are tons of details that need addressing.

Prerequisites

For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4 pandas

How to Get all the Links on the Page

From the first article in the series, we know that getting data from a webpage is easy with requests.get and BeautifulSoup. We will start by finding the links in a fake shop prepared for testing scraping.

The basics to get the content are the same. Then we get all the links on the paginator and add the links to a set. We chose set to avoid duplicates. As you can see, we hardcoded the selector for the links, meaning that it is not a universal solution. For the moment, we'll focus on the page at hand.

import requests 
from bs4 import BeautifulSoup 

to_visit = set() 
response = requests.get('https://scrapeme.live/shop/page/1/') 
soup = BeautifulSoup(response.content, 'html.parser') 
for a in soup.select('a.page-numbers'): 
    to_visit.add(a.get('href')) 

print(to_visit) 
# {'https://scrapeme.live/shop/page/2/', '.../3/', '.../46/', '.../48/', '.../4/', '.../47/'}

One URL at a Time, Sequential

Now we have several links but no way to visit them all. We need some kind of loop that will execute the extracting part for every URL available to fix that. Maybe the most straightforward way, although not the scalable one, is to use the same loop. But before that, there is a missing piece: avoid crawling the same page twice.

We'll keep track of already visited links in another set and avoid duplicates by checking them before every request. In this case, to_visit is not being used, just maintained for demo purposes. To prevent visiting every page, we'll also add a max_visits variable. For now, we ignore the robots.txt file, but we have to be civil and nice.

visited = set() 
to_visit = set() 
max_visits = 3 

def crawl(url): 
    print('Crawl: ', url) 
    response = requests.get(url) 
    soup = BeautifulSoup(response.content, 'html.parser') 
    visited.add(url) 
    for a in soup.select('a.page-numbers'): 
        link = a.get('href') 
        to_visit.add(link) 
        if link not in visited and len(visited) < max_visits: 
            crawl(link) 

crawl('https://scrapeme.live/shop/page/1/') 

print(visited) # {'.../3/', '.../1/', '.../2/'} 
print(to_visit) # { ... new ones added, such as pages 5 and 6 ... }

It is a recursive function with two exit conditions: there are no more links to visit, or we reached the maximum visits. In either case, it will exit and print the visited links and the ones pending.

It is important to note that the same link can be added many times, but it will only get crawled once. In a big project, the idea would be to set a timer and only request each URL after a few days.

Separation of Concerns

We said this is not about extracting or parsing content, but we need to separate concerns before it becomes entangled. For that, we'll create three helper functions: get HTML, extract links, and extract content. As their names imply, each of them will perform one of the main tasks of web scraping.

The first one will get the HTML from a URL using the same library as earlier but wrapping it in a try block for security.

def get_html(url): 
    try: 
        return requests.get(url).content 
    except Exception as e: 
        print(e) 
        return ''

The second one, extracting the links, will work just as before.

def extract_links(soup): 
    return [a.get('href') for a in soup.select('a.page-numbers') 
        if a.get('href') not in visited]

The last one will be the placeholder for extracting the content we want. Since we are simplifying this part, it will get basic info from the same page, no need to enter on the detail page.

To show that we can extract some content, we will print each product's title (Pokémon name).

def extract_content(soup): 
    for product in soup.select('.product'): 
        print(product.find('h2').text) 
 # Bulbasaur, Ivysaur, ...

Assembling it all together.

def crawl(url): 
    if not url or url in visited: 
        return 
    print('Crawl: ', url) 
    visited.add(url) 
    html = get_html(url) 
    soup = BeautifulSoup(html, 'html.parser') 
    extract_content(soup) 
    links = extract_links(soup) 
    to_visit.update(links)

Noticed something different? The crawling logic is not attached to the link extracting part. Each of the helpers handles a single piece. And the crawl function acts as an orchestrator by calling them and applying the results.

As the project evolves, all these parts could be moved to files or passed as parameters/callbacks. We can generalize the use cases if the core is independent of the selected page and content.

Are we missing something? 🤔
We need to add the first URL and call the crawling function. Since crawl is not recursive anymore, we'll handle that in a separate loop.

to_visit.add('https://scrapeme.live/shop/page/1/') 

while (len(to_visit) > 0 and len(visited) < max_visits): 
    crawl(to_visit.pop())

Parallel Requests

There is a significant part missing: parallelism. HTTP request handlers are idle most of the time, waiting for the response to come back. It means that we can send several of them at the same time without overloading the machine. And then process them as they came back.

It is relevant to note that this approach only works if the order is not imperative. But we are already using sets, which according to Python's definition, "a set is an unordered collection with no duplicate elements." Meaning that our process was unordered from the start.

Before diving deep into the parallel requests, we have to understand a couple of concepts: synchronization and queues.

Synchronized Queues

There is a huge risk in threaded or parallel computing: modifying the same variables or data structures from different threads. It means two of our requests would be adding new links to a set (i.e., to_visit). Since the data structure is not protected, both could read and write it like this:

Both read its content, i.e. (1, 2, 3) (simplified)
Thread one adds links to pages 4, 5: (1, 2, 3, 4, 5)
Thread two adds links to pages 6, 7: (1, 2, 3, 6, 7)

How did this happen? When thread two wrote the new links, it added them to a set with only three elements.
This is a very simplified version; check the links for more info.

What can we do to avoid these conflicts? Synchronization or locking. From the docs: "queues use locks to temporarily block competing threads." It means that thread one would acquire a lock on the set, read and write without any problem, and then release the lock automatically. Meanwhile, thread two would have to wait until the lock becomes available. Only then read and write.

import queue 

q = queue.Queue() 
q.put('https://scrapeme.live/shop/page/1/') 

def crawl(url): 
    ... 
    links = extract_links(soup) 
    for link in links: 
        if link not in visited: 
            q.put(link)

For the moment, it does not work. Do not worry. The changes in the existing code are minimum: we replaced to_visit with a queue. But queues need handlers or workers to process their content. With the above, we have created a Queue and added an item (the original one). We also modified the crawl function to put links in the queue instead of updating the previous set.

We'll create a worker using the threading module to process that queue.

from threading import Thread 

def queue_worker(i, q): 
    while True: 
        url = q.get() # Get an item from the queue, blocks until one is available 
        print('to process:', url) 
        q.task_done() # Notifies the queue that the item has been processed 

q = queue.Queue() 
Thread(target=queue_worker, args=(0, q), daemon=True).start() 

q.put('https://scrapeme.live/shop/page/1/') 
q.join() # Blocks until all items in the queue are processed and marked as done 
print('Done') 

# to process: https://scrapeme.live/shop/page/1/ 
# Done

We defined a new function that will handle the queued items. For that, we enter into an infinite loop that will stop when all the processing finishes.

Then get an item, which will block until an item is available. We process that item; for the moment, just print it to show how it works. It will call crawl later.

Finally, we notify the queue that the item has been processed by calling task_done.

Once the queue gets notified for all the items and empty, it will stop its execution and end the infinite loop. That's what the join function does, "blocks until all items in the queue have been gotten and processed."

Now we need two more things: process items and create more threads (it would not be parallel with just one, would it?).

def queue_worker(i, q): 
    while True: 
        url = q.get() 
        if (len(visited) < max_visits and url not in visited): 
            crawl(url) 
        q.task_done() 

q = queue.Queue() 
num_workers = 4 
for i in range(num_workers): 
    Thread(target=queue_worker, args=(i, q), daemon=True).start()

Be careful when running it since big numbers in num_workers and max_visits would start lots of requests. If the script had some minor bug for any reason, you could perform hundreds of requests in a few seconds.

Performance

We run benchmarks with different settings only as a reference.

Sequential requests: 29,32s
Queue with one worker (num_workers = 1): 29,41s
Queue with two workers (num_workers = 2): 20,05s
Queue with five workers (num_workers = 5): 11,97s
Queue with ten workers (num_workers = 10): 12,02s

There is almost no difference between sequential requests and having one worker. Threads carry some overhead, but it is barely noticeable here. It would require a more severe load test. Once we start adding workers, that overhead pays off. We could add even more, but it won't affect the outcome since they will be idle most of the time.

Distributed Processing

We won't cover the following scale-up step: distributing the crawling process among several servers. Python allows it, and some libraries can help you with it (Celery or Redis Queue). It is a huge step, and we have already covered enough for the day.

As a quick preview, the idea behind it is the same as the one with the threads. Each item will be processed as we've seen until now but in different threads or even machines running the same code. With this approach, we can scale even further; theoretically, with no limit. But in reality, there is always a limit or bottleneck, usually the central node that handles the distribution.

Take into Account when Scaling Up

We've shown a simplified version of a crawling process for educational purposes. To apply all this at scale, you should consider several things first.

Build vs Buy vs Open Source

Before you write your own library for crawling, try some of the options out there. Many great Open Source libraries can achieve it: Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). And many companies and services that provide you with scraping and crawling solutions.

Avoid being blocked

As we saw in a previous post, there are several actions we can take to avoid blocking. A couple of them are proxies and headers. Here is a simple snippet adding those to our current code.
Note that these free proxies might not work for you. They are short-time lived.

proxies = { 
    'http': 'http://190.64.18.177:80', 
    'https': 'http://49.12.2.178:3128', 
} 

headers = { 
    'authority': 'httpbin.org', 
    'cache-control': 'max-age=0', 
    'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"', 
    'sec-ch-ua-mobile': '?0', 
    'upgrade-insecure-requests': '1', 
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36', 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 
    'sec-fetch-site': 'none', 
    'sec-fetch-mode': 'navigate', 
    'sec-fetch-user': '?1', 
    'sec-fetch-dest': 'document', 
    'accept-language': 'en-US,en;q=0.9', 
} 

def get_html(url): 
    try: 
        response = requests.get(url, headers=headers, proxies=proxies) 
        return response.content 
    except Exception as e: 
        print(e) 
        return ''

Extracting Content

We won't go into details here, only a simple snippet for extracting id, name, and price per item. We store everything in a data array, which is not a great idea. But it is enough for demo purposes.

data = [] 

def extract_content(soup): 
    for product in soup.select('.product'): 
        data.append({ 
            'id': product.find('a', attrs={'data-product_id': True})['data-product_id'], 
            'name': product.find('h2').text, 
            'price': product.find(class_='amount').text 
        }) 

print(data) 
# [{'id': '759', 'name': 'Bulbasaur', 'price': '£63.00'}, {'id': '729', 'name': 'Ivysaur', 'price': '£87.00'}, ...]

Persistency

We haven't persisted anything, and that does not scale. In a real-world case, we should store the content and even the HTML itself for later processing. And all the discovered URLs with the timestamp time. It all starts to sound like a database is needed. Depending on the necessities, we could store just the actual content or the whole URLs, dates, HTML, etcetera generically.

Canonicals

The link extraction part does not take into consideration canonical links. A page can have more than one URL: query strings or hashes might modify it. In our case, we would crawl it twice. It's not a problem now, but something to consider.

The right approach would be to add the canonical URL (if present) to the visited list. Then we could arrive at that same page from a different origin URL, but we would detect it as duplicate. We could also remove some query string parameters using url_query_cleaner.

Robots.txt

We have not checked it because we are using a test website prepared for scraping. But please check the robots file and comply with it when crawling an actual target. And above it, do not cause more traffic than they can handle. Once again, be civil and nice ;)

Final Code

import requests 
from bs4 import BeautifulSoup 
import queue 
from threading import Thread 

starting_url = 'https://scrapeme.live/shop/page/1/' 
visited = set() 
max_visits = 100 # careful, it will crawl all the pages 
num_workers = 5 
data = [] 

def get_html(url): 
    try: 
        response = requests.get(url) 
        # response = requests.get(url, headers=headers, proxies=proxies) 
        return response.content 
    except Exception as e: 
        print(e) 
        return '' 

def extract_links(soup): 
    return [a.get('href') for a in soup.select('a.page-numbers') 
            if a.get('href') not in visited] 

def extract_content(soup): 
    for product in soup.select('.product'): 
        data.append({ 
            'id': product.find('a', attrs={'data-product_id': True})['data-product_id'], 
            'name': product.find('h2').text, 
            'price': product.find(class_='amount').text 
        }) 

def crawl(url): 
    visited.add(url) 
    print('Crawl: ', url) 
    html = get_html(url) 
    soup = BeautifulSoup(html, 'html.parser') 
    extract_content(soup) 
    links = extract_links(soup) 
    for link in links: 
        if link not in visited: 
            q.put(link) 

def queue_worker(i, q): 
    while True: 
        url = q.get() # Get an item from the queue, blocks until one is available 
        if (len(visited) < max_visits and url not in visited): 
            crawl(url) 
        q.task_done() # Notifies the queue that the item has been processed 

q = queue.Queue() 
for i in range(num_workers): 
    Thread(target=queue_worker, args=(i, q), daemon=True).start() 

q.put(starting_url) 
q.join() # Blocks until all items in the queue are processed and marked as done 

print('Done') 
print('Visited:', visited) 
print('Data:', data)

Conclusion

We'd like you to part with three main points:

Separate getting the HTML and extracting the links from the crawling itself.
Choose the appropriate system for your use case: simple sequential, parallel, or distributed.
Building from scratch to a vast scale will probably hurt. Take a look at free or paid libraries/solutions.

We are close to finishing this series on Web Scraping. Stay tuned for the next one on scaling this crawling process even further.

Do not forget to take a look at the rest of the posts in this series.

Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com

Stealth Web Scraping in Python: Avoid Blocking Like a Ninja

Ander Rodriguez — Thu, 29 Jul 2021 14:06:34 +0000

Scraping should be about extracting content from HTML. Sounds simple. Sometimes it is not. It has many obstacles. The first one is to obtain the said HTML.

You can open a browser, go to a URL, and it's there. Dead simple. We're done. 👩‍💻

If you don't need a bigger scale, that's it; you're done. But bear with us if that's not the case and you want to learn how to scrape thousand of URLs without blocking.

Websites tend to protect their data and access. There are many possible actions a defensive system could take. We'll start a journey through some of them and learn how to avoid or mitigate their impact.

Note: when testing at scale, never use your home IP directly. A small mistake or slip and you will get banned.

Prerequisites

For the code to work, you will need python3 installed . Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.

pip install requests beautifulsoup4 pandas

IP Rate Limit

The most basic security system is to ban or throttle requests from the same IP. It means that a regular user would not request a hundred pages in a few seconds, so they proceed to tag that connection as dangerous.

import requests

response = requests.get('http://httpbin.org/ip')
print(response.json()['origin']) # xyz.84.7.83

IP rate limits work similar to API rate limits, but there is usually no public information about them. We cannot know for sure how many requests we can do safely.

Our Internet Service Provider assigns us our IP, and we cannot affect nor mask it. The solution is to change it. We cannot modify a machine's IP, but we can use different machines. Datacenters might have different IPs, although that is not a real solution.

Proxies are. They take an incoming request and relay it to the final destination. It does no processing there. But that is enough to mask our IP since the target website will see the proxies IP.

Rotating Proxies

There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.

Now we have a different IP, and our home connection is safe and sound. Good. But what if they block the proxy's IP? We are back to the initial position.

We won't go into detail about free proxies. Just use the next one on the list. Change them frequently since their lifespan is usually short.

Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.

import requests

proxies = {'http': 'http://190.64.18.177:80'}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print(response.json()['origin']) # 190.64.18.162

We know about these; it means they know about it too. Some big companies won't allow traffic from known proxy IPs or datacenters. For those cases, there is a higher proxy level: Residential.

More expensive and sometimes bandwidth-limited, residential proxies offer us IPs in use by "normal" people. Implying that our mobile provider will assign us that IP tomorrow. Or a friend had it yesterday. They are indistinguishable from actual final users.

We can scrape whatever we want, right? The cheaper ones by default, the expensive ones when necessary. No, not there yet. We only passed the first hurdle, some more to come. We have to look like a legitimate user to avoid being tagged as a bot or scraper.

User-Agent Header

The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. UA follows a format that we'll see later, and many software tools have their own, for example, GoogleBot. Here is what the target website will receive if we directly use "python requests" or curl.

import requests

response = requests.get('http://httpbin.org/headers')
print(response.json()['headers']['User-Agent'])
# python-requests/2.25.1

curl http://httpbin.org/headers
# { ... "User-Agent": "curl/7.74.0" ... }

Many sites won't check UA, but this is a huge red flag for the ones that do this. We'll have to fake it. Luckily, most libraries allow custom headers. Following the example using requests:

import requests

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
response = requests.get('http://httpbin.org/headers', headers=headers)
print(response.json()['headers']['User-Agent'])
# Mozilla/5.0 ...

To get your current UA, visit httpbin - just as the code snippet is doing - and copy it. Requesting all the URLs with the same UA might also trigger some alerts, so the solution is again a bit more complicated.

Ideally, we would have all the current possible User-Agents and rotate them as we did with the IPs. Since that is close to impossible, we can at least have a few. There are lists of User Agents available for us to choose from.

import requests
import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
    'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
    'Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36'
]
user_agent = random.choice(user_agents)
headers = {'User-Agent': user_agent}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers']['User-Agent'])
# Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) ...

Keep in mind that browsers change versions quite often, and this list can be obsolete in a few months. If we are to use User-Agent rotation, a reliable source is essential. We can do it by hand or use a service provider.

We are a step closer, but there is still one flaw in the headers: they also know this trick and check other headers along with the UA.

Full Set of Headers

Each browser, or even version, sends different headers. Check Chrome and Firefox in action: (ignore X-Amzn-Trace-Id)

{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-US,en;q=0.9", 
    "Host": "httpbin.org", 
    "Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"", 
    "Sec-Ch-Ua-Mobile": "?0", 
    "Sec-Fetch-Dest": "document", 
    "Sec-Fetch-Mode": "navigate", 
    "Sec-Fetch-Site": "none", 
    "Sec-Fetch-User": "?1", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-60ff12bb-55defac340ac48081d670f9d"
  }
}

{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-US,en;q=0.5", 
    "Host": "httpbin.org", 
    "Sec-Fetch-Dest": "document", 
    "Sec-Fetch-Mode": "navigate", 
    "Sec-Fetch-Site": "none", 
    "Sec-Fetch-User": "?1", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0", 
    "X-Amzn-Trace-Id": "Root=1-60ff12e8-229efca73430280304023fb9"
  }
}

It means what you think it means. The previous array with 5 User Agents is incomplete. We need an array with a complete set of headers per User-Agent. For brevity, we will show a list with one item. It is already long enough.

In this case, copying the result from httpbin is not enough. The ideal would be to copy it directly from the source. The easiest way to do it is from the Firefox or Chrome DevTools - or equivalent in your browser. Go to the Network tab, visit the target website, right-click on the request and copy as cURL. Then convert curl syntax to Python and paste the headers in the list.

import requests
import random

headers_list = [ {
    'authority': 'httpbin.org',
    'cache-control': 'max-age=0',
    'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
    'sec-ch-ua-mobile': '?0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
}
# , {...}
]
headers = random.choice(headers_list)
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers'])

We could add a Referer header for extra security - such as Google or an internal page from the same website. It would mask the fact that we are always requesting URLs directly without any interaction. But be careful since they could also detect fake referrers.

Cookies

We ignored the cookies in the headers section because the best option is not to use them. Unless we want to log in, we can ignore them most of the time. Some websites will block the content or redirect to login after a few visits. They use cookies for that. Thus we can avoid a possible block or login wall in not sending them.

Other websites will be more tolerant if we perform several actions from the same IP with those session cookies. But it is difficult to say when it will work unless we test it. There is no silver bullet.

Anyhow, it might look suspicious if all our requests go without cookies, but allowing them will be worse if we are not extremely careful. And there are no upsides to using session cookies for the initial request. But what happens if we want content generated in the browser after XHR calls?

First, we will need to use a headless browser. Second, not sending cookies will look more than suspicious in these cases. Right after the initial load, the Javascript will try to get some content using an XHR call. There is no way we can do that call without cookies. A legitimate user cannot do that.

Headless Browsers

While avoiding them - for performance reasons - would be preferable, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera)

import json
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # p.webkit is also supported, but there is a problem on Linux
    for browser_type in [p.chromium, p.firefox]:
        browser = browser_type.launch()
        page = browser.new_page()
        page.goto('https://httpbin.org/headers')
    jsonContent = json.loads(page.inner_text('pre'))
        print(jsonContent['headers']['User-Agent'])
        browser.close()

# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/93.0.4576.0 Safari/537.36
# Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0

Back to the headers section: we can add custom headers that will overwrite the default ones. Replace the line in the previous snippet with this one and paste a valid User-Agent:

browser.new_page(extra_http_headers={'User-Agent': '...'})

That is just an entry-level with headless browsers. Headless detection is a field in itself, and many people are working on it. Some to detect it, some to avoid being detected. As an example, you can visit pixelscan with an actual browser and a headless one. To be deemed "consistent," you'll need to work hard.

Take a look at the screenshot below, taken when visiting pixelscan with playwright. See the UA? The one we fake is all right, but they can detect that we are lying by checking the navigator Javascript API. We could simulate that too (a bit more complicated), but then another check would identify the headless browser. In summary, having 100% coverage is complex, but you won't need it most of the time.

Analyzing the Firefox output from the previous example, it looks like a legit one, almost the same we see using a real browser in a Linux machine. Our advice is not to lie if possible. Faking an iPhone UA is easy; the problems come afterward. You can also fake navigator.userAgent and some others. But probably not WebGL, touch events, or battery status.

If you are running a crawler on the cloud, probably a Linux machine, it might look a bit unusual. But it is entirely legit to run Firefox on Linux - we are doing it right now.

Geographic Limits or Geoblocking

Have you ever tried to watch CNN from outside the US?

That's called geoblocking. Only connections from inside the US can watch CNN live. To bypass that, we could use a Virtual Private Network (VPN). We can then browse as usual, but the website will see a local IP thanks to the VPN.

The same can happen when scraping websites with geoblocking. There is an equivalent for proxies: geolocated proxies. Some proxy providers allow us to choose from a list of countries. With that activated, we will only get local IPs from the US, for example.

Behavioral patterns

Blocking IPs and User-Agents is not enough these days. They become unmanageable and stale in hours, if not minutes. As long as we perform requests with "clean" IPs and real-world User-Agents, we are mainly safe. There are more factors involved, but most of the requests should be valid.

However, modern anti-bot software use machine learning and behavioral patterns, not just static markers (IP, UA, geolocation). That means that we would be detected if we always performed the same actions in the same order.

Go to the homepage
Click on the "Shop" button
Scroll down
Go to page 2
...

After a few days, launching the same script could result in every request being blocked. Many people can perform those same actions, but bots have something that makes them obvious: speed. With software, we would execute every step sequentially, while an actual user would take a second, then click, scroll down slowly using the mouse wheel, move the mouse to the link and click.

Maybe there is no need to fake all that, but be aware of the possible problems and know how to face them.

We have to think what is what we want. Maybe we don't need that first request since we only require the second page. We could use that as an entry point, not the homepage. And save one request. It can scale to hundreds of URLs per domain. No need to visit every page in order, scroll down, click on the next page and start again.

To scrape search results, once we recognize the URL pattern for pagination, we only need two data points: the number of items and items per page. And most of the time, that info is present on the first page or request.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://scrapeme.live/shop/")
soup = BeautifulSoup(response.content, 'html.parser')
pages = soup.select(".woocommerce-pagination a.page-numbers:not(.next)")
print(pages[0].get('href')) # https://scrapeme.live/shop/page/2/
print(pages[-1].get('href')) # https://scrapeme.live/shop/page/48/

One request shows us that there are 48 pages. We can now queue them. Mixing with the other techniques, we would scrape the content from this page and add the remaining 47. To scrape them, we could:

Shuffle the page order to avoid pattern detection
Use different IPs and User-Agent, so each request looks like a new one
Add delays between some of the calls
Use Google as a referrer randomly

We could write some snippet mixing all of these, but the best option in real life is to use a tool with it all like Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). The idea being the snippets is to understand each problem on its own. But for large-scale, real-life projects, handling everything on our own would be too complicated.

Captchas

Even the best-prepared request can get caught and shown a captcha. Nowadays, solving captchas is achievable - Anti-Captcha and 2Captcha - but a waste of time and money. The best solution is to avoid them. The second best is to forget about that request and retry.

It might sound counterintuitive, but waiting for a second and retrying the same request with a different IP and set of headers will be faster than solving a captcha. Try it yourself and tell us about the experience 😉.

Login Wall or Paywall
Some websites prefer to show or redirect users to a login page instead of a captcha. Instagram will redirect anonymous users after a few visits, and Medium will show a paywall.

As with the captchas, a solution can be to tag the IP as "dirty," forget the request and try again. Libraries usually follow redirects by default but offer an option not to allow them. Ideally, we would only disallow redirects to log in, sign up, or specific pages, not all of them. In this example, we will follow redirects until its location contains "accounts/login." In an actual project, we would queue that page with a delay and try again.

import sys
import requests

session = requests.session()
response = session.get('http://instagram.com', allow_redirects=False)
print(response.status_code, response.headers.get('location'))
for redirect in session.resolve_redirects(response, response.request):
    location = redirect.headers.get('location')
    print(redirect.status_code, location)
    if location and "accounts/login" in location:
        sys.exit() # no need to exit, return would be enough
# 301 https://instagram.com/
# 301 https://www.instagram.com/
# 302 https://www.instagram.com/accounts/login/

Be a good internet citizen

We can use several websites for testing, but be careful when doing the same at scale. Try to be a good internet citizen a do not cause -small- DDoS. Limit your interactions per domain. Amazon can handle thousands of requests per second. But not all target sites will.

We are always talking about "read-only" browsing mode. Access a page and read its contents. Never submit a form or perform active actions with malicious intent.

If we were to take a more active approach, several other factors would matter: writing speed, mouse movement, navigation without clicking, browsing many pages simultaneously, etcetera. Bot prevention software is specifically aggressive with active actions. As it should for security reasons.

We won't go into details on this part, but these actions will give them new reasons to block requests. Again, good citizens don't try massive logins. We are talking about scraping, not malicious activities.

Sometimes websites make data collections harder, maybe not on purpose. But with modern frontend tools, CSS classes could change every day, ruining thoroughly prepared scripts. For more details, read our previous entry on how to scrape data in python.

Conclusion

We'd like you to remember the low-hanging fruits:

IP rotating proxies
Full set headers, including User-Agent
Avoid patterns that might tag you as a bot

There are many more, and probably more we didn't cover. But with these techniques, you should be able to crawl and scrape at scale.

Remember, we covered scraping and avoiding being blocked, but there is much more: crawling, converting and storing the content, scaling the infrastructure, and more. Stay tuned!

Do not forget to take a look at the rest of the posts in this series.

Did you find the content helpful? Please, spread the word and share it. 👈

Originally published at https://www.zenrows.com