Scraping should be about extracting content from HTML. Sounds simple. Sometimes it is not. It has many obstacles. The first one is to obtain the said HTML.
You can open a browser, go to a URL, and it's there. Dead simple. We're done. ๐ฉโ๐ป
If you don't need a bigger scale, that's it; you're done. But bear with us if that's not the case and you want to learn how to scrape thousand of URLs without blocking.
Websites tend to protect their data and access. There are many possible actions a defensive system could take. We'll start a journey through some of them and learn how to avoid or mitigate their impact.
Note: when testing at scale, never use your home IP directly. A small mistake or slip and you will get banned.
Prerequisites
For the code to work, you will need python3 installed . Some systems have it pre-installed. After that, install all the necessary libraries by running pip install
.
pip install requests beautifulsoup4 pandas
IP Rate Limit
The most basic security system is to ban or throttle requests from the same IP. It means that a regular user would not request a hundred pages in a few seconds, so they proceed to tag that connection as dangerous.
import requests
response = requests.get('http://httpbin.org/ip')
print(response.json()['origin']) # xyz.84.7.83
IP rate limits work similar to API rate limits, but there is usually no public information about them. We cannot know for sure how many requests we can do safely.
Our Internet Service Provider assigns us our IP, and we cannot affect nor mask it. The solution is to change it. We cannot modify a machine's IP, but we can use different machines. Datacenters might have different IPs, although that is not a real solution.
Proxies are. They take an incoming request and relay it to the final destination. It does no processing there. But that is enough to mask our IP since the target website will see the proxies IP.
Rotating Proxies
There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.
Now we have a different IP, and our home connection is safe and sound. Good. But what if they block the proxy's IP? We are back to the initial position.
We won't go into detail about free proxies. Just use the next one on the list. Change them frequently since their lifespan is usually short.
Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.
import requests
proxies = {'http': 'http://190.64.18.177:80'}
response = requests.get('http://httpbin.org/ip', proxies=proxies)
print(response.json()['origin']) # 190.64.18.162
We know about these; it means they know about it too. Some big companies won't allow traffic from known proxy IPs or datacenters. For those cases, there is a higher proxy level: Residential.
More expensive and sometimes bandwidth-limited, residential proxies offer us IPs in use by "normal" people. Implying that our mobile provider will assign us that IP tomorrow. Or a friend had it yesterday. They are indistinguishable from actual final users.
We can scrape whatever we want, right? The cheaper ones by default, the expensive ones when necessary. No, not there yet. We only passed the first hurdle, some more to come. We have to look like a legitimate user to avoid being tagged as a bot or scraper.
User-Agent Header
The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. UA follows a format that we'll see later, and many software tools have their own, for example, GoogleBot. Here is what the target website will receive if we directly use "python requests" or curl.
import requests
response = requests.get('http://httpbin.org/headers')
print(response.json()['headers']['User-Agent'])
# python-requests/2.25.1
curl http://httpbin.org/headers
# { ... "User-Agent": "curl/7.74.0" ... }
Many sites won't check UA, but this is a huge red flag for the ones that do this. We'll have to fake it. Luckily, most libraries allow custom headers. Following the example using requests:
import requests
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
response = requests.get('http://httpbin.org/headers', headers=headers)
print(response.json()['headers']['User-Agent'])
# Mozilla/5.0 ...
To get your current UA, visit httpbin - just as the code snippet is doing - and copy it. Requesting all the URLs with the same UA might also trigger some alerts, so the solution is again a bit more complicated.
Ideally, we would have all the current possible User-Agents and rotate them as we did with the IPs. Since that is close to impossible, we can at least have a few. There are lists of User Agents available for us to choose from.
import requests
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148',
'Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36'
]
user_agent = random.choice(user_agents)
headers = {'User-Agent': user_agent}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers']['User-Agent'])
# Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) ...
Keep in mind that browsers change versions quite often, and this list can be obsolete in a few months. If we are to use User-Agent rotation, a reliable source is essential. We can do it by hand or use a service provider.
We are a step closer, but there is still one flaw in the headers: they also know this trick and check other headers along with the UA.
Full Set of Headers
Each browser, or even version, sends different headers. Check Chrome and Firefox in action: (ignore X-Amzn-Trace-Id)
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-60ff12bb-55defac340ac48081d670f9d"
}
}
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Host": "httpbin.org",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0",
"X-Amzn-Trace-Id": "Root=1-60ff12e8-229efca73430280304023fb9"
}
}
It means what you think it means. The previous array with 5 User Agents is incomplete. We need an array with a complete set of headers per User-Agent. For brevity, we will show a list with one item. It is already long enough.
In this case, copying the result from httpbin is not enough. The ideal would be to copy it directly from the source. The easiest way to do it is from the Firefox or Chrome DevTools - or equivalent in your browser. Go to the Network tab, visit the target website, right-click on the request and copy as cURL. Then convert curl syntax to Python and paste the headers in the list.
import requests
import random
headers_list = [ {
'authority': 'httpbin.org',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
# , {...}
]
headers = random.choice(headers_list)
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers'])
We could add a Referer header for extra security - such as Google or an internal page from the same website. It would mask the fact that we are always requesting URLs directly without any interaction. But be careful since they could also detect fake referrers.
Cookies
We ignored the cookies in the headers section because the best option is not to use them. Unless we want to log in, we can ignore them most of the time. Some websites will block the content or redirect to login after a few visits. They use cookies for that. Thus we can avoid a possible block or login wall in not sending them.
Other websites will be more tolerant if we perform several actions from the same IP with those session cookies. But it is difficult to say when it will work unless we test it. There is no silver bullet.
Anyhow, it might look suspicious if all our requests go without cookies, but allowing them will be worse if we are not extremely careful. And there are no upsides to using session cookies for the initial request. But what happens if we want content generated in the browser after XHR calls?
First, we will need to use a headless browser. Second, not sending cookies will look more than suspicious in these cases. Right after the initial load, the Javascript will try to get some content using an XHR call. There is no way we can do that call without cookies. A legitimate user cannot do that.
Headless Browsers
While avoiding them - for performance reasons - would be preferable, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera)
import json
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# p.webkit is also supported, but there is a problem on Linux
for browser_type in [p.chromium, p.firefox]:
browser = browser_type.launch()
page = browser.new_page()
page.goto('https://httpbin.org/headers')
jsonContent = json.loads(page.inner_text('pre'))
print(jsonContent['headers']['User-Agent'])
browser.close()
# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/93.0.4576.0 Safari/537.36
# Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0
This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.
Back to the headers section: we can add custom headers that will overwrite the default ones. Replace the line in the previous snippet with this one and paste a valid User-Agent:
browser.new_page(extra_http_headers={'User-Agent': '...'})
That is just an entry-level with headless browsers. Headless detection is a field in itself, and many people are working on it. Some to detect it, some to avoid being detected. As an example, you can visit pixelscan with an actual browser and a headless one. To be deemed "consistent," you'll need to work hard.
Take a look at the screenshot below, taken when visiting pixelscan with playwright. See the UA? The one we fake is all right, but they can detect that we are lying by checking the navigator Javascript API. We could simulate that too (a bit more complicated), but then another check would identify the headless browser. In summary, having 100% coverage is complex, but you won't need it most of the time.
Analyzing the Firefox output from the previous example, it looks like a legit one, almost the same we see using a real browser in a Linux machine. Our advice is not to lie if possible. Faking an iPhone UA is easy; the problems come afterward. You can also fake navigator.userAgent
and some others. But probably not WebGL, touch events, or battery status.
If you are running a crawler on the cloud, probably a Linux machine, it might look a bit unusual. But it is entirely legit to run Firefox on Linux - we are doing it right now.
Geographic Limits or Geoblocking
Have you ever tried to watch CNN from outside the US?
That's called geoblocking. Only connections from inside the US can watch CNN live. To bypass that, we could use a Virtual Private Network (VPN). We can then browse as usual, but the website will see a local IP thanks to the VPN.
The same can happen when scraping websites with geoblocking. There is an equivalent for proxies: geolocated proxies. Some proxy providers allow us to choose from a list of countries. With that activated, we will only get local IPs from the US, for example.
Behavioral patterns
Blocking IPs and User-Agents is not enough these days. They become unmanageable and stale in hours, if not minutes. As long as we perform requests with "clean" IPs and real-world User-Agents, we are mainly safe. There are more factors involved, but most of the requests should be valid.
However, modern anti-bot software use machine learning and behavioral patterns, not just static markers (IP, UA, geolocation). That means that we would be detected if we always performed the same actions in the same order.
- Go to the homepage
- Click on the "Shop" button
- Scroll down
- Go to page 2
- ...
After a few days, launching the same script could result in every request being blocked. Many people can perform those same actions, but bots have something that makes them obvious: speed. With software, we would execute every step sequentially, while an actual user would take a second, then click, scroll down slowly using the mouse wheel, move the mouse to the link and click.
Maybe there is no need to fake all that, but be aware of the possible problems and know how to face them.
We have to think what is what we want. Maybe we don't need that first request since we only require the second page. We could use that as an entry point, not the homepage. And save one request. It can scale to hundreds of URLs per domain. No need to visit every page in order, scroll down, click on the next page and start again.
To scrape search results, once we recognize the URL pattern for pagination, we only need two data points: the number of items and items per page. And most of the time, that info is present on the first page or request.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://scrapeme.live/shop/")
soup = BeautifulSoup(response.content, 'html.parser')
pages = soup.select(".woocommerce-pagination a.page-numbers:not(.next)")
print(pages[0].get('href')) # https://scrapeme.live/shop/page/2/
print(pages[-1].get('href')) # https://scrapeme.live/shop/page/48/
One request shows us that there are 48 pages. We can now queue them. Mixing with the other techniques, we would scrape the content from this page and add the remaining 47. To scrape them, we could:
- Shuffle the page order to avoid pattern detection
- Use different IPs and User-Agent, so each request looks like a new one
- Add delays between some of the calls
- Use Google as a referrer randomly
We could write some snippet mixing all of these, but the best option in real life is to use a tool with it all like Scrapy, pyspider, node-crawler (Node.js), or Colly (Go). The idea being the snippets is to understand each problem on its own. But for large-scale, real-life projects, handling everything on our own would be too complicated.
Captchas
Even the best-prepared request can get caught and shown a captcha. Nowadays, solving captchas is achievable - Anti-Captcha and 2Captcha - but a waste of time and money. The best solution is to avoid them. The second best is to forget about that request and retry.
It might sound counterintuitive, but waiting for a second and retrying the same request with a different IP and set of headers will be faster than solving a captcha. Try it yourself and tell us about the experience ๐.
Login Wall or Paywall
Some websites prefer to show or redirect users to a login page instead of a captcha. Instagram will redirect anonymous users after a few visits, and Medium will show a paywall.
As with the captchas, a solution can be to tag the IP as "dirty," forget the request and try again. Libraries usually follow redirects by default but offer an option not to allow them. Ideally, we would only disallow redirects to log in, sign up, or specific pages, not all of them. In this example, we will follow redirects until its location contains "accounts/login." In an actual project, we would queue that page with a delay and try again.
import sys
import requests
session = requests.session()
response = session.get('http://instagram.com', allow_redirects=False)
print(response.status_code, response.headers.get('location'))
for redirect in session.resolve_redirects(response, response.request):
location = redirect.headers.get('location')
print(redirect.status_code, location)
if location and "accounts/login" in location:
sys.exit() # no need to exit, return would be enough
# 301 https://instagram.com/
# 301 https://www.instagram.com/
# 302 https://www.instagram.com/accounts/login/
Be a good internet citizen
We can use several websites for testing, but be careful when doing the same at scale. Try to be a good internet citizen a do not cause -small- DDoS. Limit your interactions per domain. Amazon can handle thousands of requests per second. But not all target sites will.
We are always talking about "read-only" browsing mode. Access a page and read its contents. Never submit a form or perform active actions with malicious intent.
If we were to take a more active approach, several other factors would matter: writing speed, mouse movement, navigation without clicking, browsing many pages simultaneously, etcetera. Bot prevention software is specifically aggressive with active actions. As it should for security reasons.
We won't go into details on this part, but these actions will give them new reasons to block requests. Again, good citizens don't try massive logins. We are talking about scraping, not malicious activities.
Sometimes websites make data collections harder, maybe not on purpose. But with modern frontend tools, CSS classes could change every day, ruining thoroughly prepared scripts. For more details, read our previous entry on how to scrape data in python.
Conclusion
We'd like you to remember the low-hanging fruits:
IP rotating proxies
Full set headers, including User-Agent
Avoid patterns that might tag you as a bot
There are many more, and probably more we didn't cover. But with these techniques, you should be able to crawl and scrape at scale.
Contact us if you know any more website scraping tricks or have doubts about applying them.
Remember, we covered scraping and avoiding being blocked, but there is much more: crawling, converting and storing the content, scaling the infrastructure, and more. Stay tuned!
Do not forget to take a look at the rest of the posts in this series.
- Scaling to Distributed Crawling (4/4)
- Crawling from Scratch (3/4)
- Mastering Extraction (1/4)
Did you find the content helpful? Please, spread the word and share it. ๐
Originally published at https://www.zenrows.com
Top comments (1)
Great article with lots of insights.
Thank you so much!!
Waiting for the next article to show up real quick.